Welcome to Representation and Summary of Data!

Ever looked at a massive pile of numbers and felt your head spin? Don't worry, we've all been there! This chapter is your toolkit for turning "data chaos" into clear, meaningful stories. Whether you are looking at test scores, weather patterns, or sports stats, the techniques you learn here will help you visualize and summarize information like a pro.

Why is this important? In the real world, "Big Data" is everywhere. Companies use these exact methods to decide what products to sell, and doctors use them to understand how well a new medicine works. By the end of this section, you'll be able to "speak" the language of data!


1. Visualizing Data: Pictures That Tell a Story

Sometimes, a graph is worth a thousand numbers. We use three main types of diagrams in S1 to see the "shape" of our data.

A. Stem and Leaf Diagrams

Think of this as a way to organize numbers into a "bookshelf." The "stem" is like the shelf category (e.g., tens), and the "leaves" are the individual items (e.g., units).

Key Point: Always include a Key! For example: Key: 2 | 5 means 25. Without a key, your diagram is just a bunch of mysterious numbers.

Real-world Analogy: Imagine sorting your laundry by color (the stem) and then seeing how many socks, shirts, and pants are in each pile (the leaves).

B. Box Plots (Box-and-Whisker)

These are fantastic for comparing two sets of data side-by-side (like the heights of two different basketball teams). A box plot shows five key numbers:

  1. The Minimum value.
  2. The Lower Quartile (\(Q_1\)): The 25% mark.
  3. The Median (\(Q_2\)): The middle value (50% mark).
  4. The Upper Quartile (\(Q_3\)): The 75% mark.
  5. The Maximum value.

C. Histograms

Histograms are special. Unlike bar charts, the area of the bar represents the frequency, not the height. We use them for continuous data (things we measure, like time or weight).

The Golden Formula: \( \text{Frequency Density} = \frac{\text{Frequency}}{\text{Class Width}} \)

Common Mistake to Avoid: Never just plot frequency on the vertical axis if the class widths are different. Always calculate the Frequency Density first!

Quick Review:
- Stem and Leaf: Best for seeing every individual value.
- Box Plots: Best for comparing the spread and medians.
- Histograms: Best for showing the distribution of measured data.


2. Measures of Location: Finding the "Center"

Where does the data "sit"? We use three main "averages" to find out.

The Big Three: Mode, Median, and Mean

  • Mode: The value that appears most often. (Fashionable = Most popular!).
  • Median (\(Q_2\)): The middle value when numbers are in order. If there are \(n\) items, the position is \(\frac{n+1}{2}\).
  • Mean (\(\bar{x}\)): The "fair share" average. Sum everything up and divide by how many there are.
    Formula: \( \bar{x} = \frac{\sum x}{n} \) or for grouped data: \( \bar{x} = \frac{\sum fx}{\sum f} \).

Understanding Coding

Don't worry if this seems tricky at first! Coding is just a way to make big numbers smaller and easier to work with. We "code" data using a formula like \( y = \frac{x - a}{b} \).

Simple Trick:
- If you add or subtract a number from every value, the mean changes by that amount.
- If you multiply or divide every value, the mean also gets multiplied or divided.

Example: If the average temperature is 20°C and we add 5° to every reading, the new mean is 25°C. Simple!

Takeaway: The mean is sensitive to every single number in the data set, while the median only cares about the middle position.


3. Measures of Dispersion: How "Spread Out" is it?

Two sets of data might have the same mean but look totally different. Dispersion tells us if the data is tightly packed or all over the place.

Range and Interquartile Range (IQR)

  • Range: Max minus Min. Easy, but easily ruined by one weirdly large or small number (outliers).
  • Interquartile Range: \( \text{IQR} = Q_3 - Q_1 \). This tells us the spread of the middle 50% of the data. It's much more reliable because it ignores those "weird" numbers at the ends!

Variance and Standard Deviation

These are the "heavy hitters" of statistics. They tell us the average distance of the data points from the mean.

  • Variance (\(\sigma^2\)): The "mean of the squares minus the square of the mean."
    Formula: \( \sigma^2 = \frac{\sum x^2}{n} - (\bar{x})^2 \)
  • Standard Deviation (\(\sigma\)): Just the square root of the variance!
    Formula: \( \sigma = \sqrt{\text{Variance}} \)

Memory Aid: For Variance, remember "MS-SM" (Mean of the Squares minus Square of the Mean). It's a lifesaver in exams!

Did you know? Standard deviation is used in finance to measure "risk." A high standard deviation in a stock price means it's a "bumpy ride" (high risk)!


4. Skewness and Outliers

Now we look at the "personality" of the data. Is it balanced, or does it lean to one side?

Skewness

Imagine the data distribution as a hill.
- Positive Skew: The "tail" points to the right (positive direction). Most data is bunched at the low end. (Think: A few very rich people in a poor neighborhood).
- Negative Skew: The "tail" points to the left (negative direction). Most data is bunched at the high end. (Think: An easy exam where most students get high marks).
- Symmetrical: It looks like a perfect bell. Mean \(\approx\) Median \(\approx\) Mode.

Outliers

An outlier is a "rebel" data point that is much larger or smaller than the rest. In your exam, you'll be given a rule to find them.
Common Rule: Any value more than \( 1.5 \times \text{IQR} \) above \(Q_3\) or below \(Q_1\).
Step-by-step:
1. Calculate \( \text{IQR} = Q_3 - Q_1 \).
2. Multiply it by 1.5.
3. Add that to \(Q_3\) (upper boundary) and subtract it from \(Q_1\) (lower boundary).
4. Anything outside these boundaries is an outlier!

Key Takeaway: Always check for outliers before drawing your box plot whiskers! Whiskers usually stop at the last "normal" data point, and outliers are marked with an 'x'.


Final Summary Checklist

Before you move on, make sure you can:
- Calculate Frequency Density for Histograms.
- Find the Median and Quartiles from a list or table.
- Use the "Mean of the squares minus square of the mean" for Variance.
- Identify Outliers using the \(1.5 \times \text{IQR}\) rule.
- Explain if data is Positively or Negatively skewed.

You've got this! Practice a few questions on coding and histograms, as those are the "trickiest" parts of this chapter. Good luck!