Welcome to Single Variable Data!
In this chapter, we are going to learn how to take a big, messy pile of numbers and turn them into a story that makes sense. Whether it's the heights of students in your class or the daily rainfall in Cambridge, statistics helps us see the patterns. We will look at how to draw data, how to find the "middle" of it, and how to measure how "spread out" it is. Don't worry if you’ve struggled with stats before; we’ll take it one step at a time!
1. Representing Data Visually
Pictures are often easier to understand than lists of numbers. Depending on your data, different diagrams work better than others.
Stem-and-Leaf Diagrams
Think of this as a way to sort data while keeping the original numbers visible. The "stem" is the first digit(s), and the "leaf" is the last digit.
Example: If you have the numbers 21, 23, and 35:
The stem is 2, the leaves are 1 and 3.
The stem is 3, the leaf is 5.
Memory Trick: Just like a real plant, the leaves grow out from the stem. Always include a key (e.g., 2 | 1 means 21) so people know what your units are!
Box-and-Whisker Plots
These are great for showing the "spread" of data. They use five key numbers:
1. The minimum (lowest value)
2. The lower quartile (Q1) (25% of the way through)
3. The median (Q2) (the middle)
4. The upper quartile (Q3) (75% of the way through)
5. The maximum (highest value)
Analogy: Imagine a ruler. The box represents the middle 50% of people standing in a line. If the box is wide, the middle group is very varied. If it's thin, they are all very similar.
Histograms
Histograms look like bar charts, but they have one very important rule: Area represents Frequency.
This is vital for data with different group sizes (class widths). Instead of just plotting "frequency" on the vertical axis, we plot Frequency Density.
\( \text{Frequency Density} = \frac{\text{Frequency}}{\text{Class Width}} \)
Common Mistake to Avoid: Never just draw the heights based on the frequency if the groups are different widths. Always calculate the density first!
Quick Takeaway: Choose your diagram based on what you want to show. Use box plots to compare spreads and histograms to show the "shape" of the data distribution.
2. Measures of Central Tendency (The "Averages")
When we want to describe a whole group with just one number, we use an average.
The Mean (\( \bar{x} \))
The "fair share" average. You add everything up and divide by how many items there are.
\( \bar{x} = \frac{\sum x}{n} \)
Did you know? The mean is very sensitive to outliers (extreme values). If a billionaire walks into a room of students, the "mean" wealth of the room suddenly becomes millions, even though everyone else is still broke!
The Median
The literal middle. Line the numbers up from smallest to largest and find the center point.
Simple Trick: If you have an odd number of items, it's the middle one. If you have an even number, it's the halfway point between the two middle ones.
The Mode
The most common value. This is the only average you can use for non-numerical data (like "favourite color").
Key Takeaway: Use the median if your data has extreme outliers, as it isn't "pulled" away from the center by one weirdly large or small number.
3. Measures of Variation (The "Spread")
Knowing the average isn't enough. We need to know if the data is all huddled together or scattered all over the place.
Range and Inter-quartile Range (IQR)
Range: Maximum - Minimum. (Very affected by outliers).
IQR: \( Q_3 - Q_1 \). This tells you the spread of the middle 50% of the data. It's much more reliable because it ignores the extremes at the ends.
Variance and Standard Deviation (\( \sigma \))
Standard deviation is the "average distance from the mean." If the standard deviation is small, the data is close to the average.
The syllabus defines Standard Deviation as the root mean square deviation from the mean. You can use these formulas:
\( \sigma = \sqrt{\frac{\sum (x - \bar{x})^2}{n}} \) or the "computational" version: \( \sigma = \sqrt{\frac{\sum x^2}{n} - \bar{x}^2} \)
Variance is simply the standard deviation squared (\( \sigma^2 \)).
Don't worry if this seems tricky! Your calculator has a "Statistics mode" that can do most of this heavy lifting for you. Practice finding the \(\sum x\) and \(\sum x^2\) buttons!
Key Takeaway: Standard deviation uses every single piece of data, making it very powerful, but the IQR is better if your data is "messy" with lots of outliers.
4. Outliers and Cleaning Data
Sometimes data is just wrong—maybe someone made a typo, or a sensor broke. We call these outliers.
How to spot an Outlier
In your exam, you usually use one of two mathematical "fences" to find outliers:
1. The IQR Rule: Anything more than \( 1.5 \times IQR \) above \( Q_3 \) or below \( Q_1 \).
2. The Standard Deviation Rule: Anything more than \( 2 \times \sigma \) away from the mean.
Analogy: Think of these rules as "security guards." If a data point stands too far away from the group, the guard flags it for inspection!
Cleaning Data
When you find an outlier, you have to decide: Is it a genuine (but weird) piece of data, or is it an error? Cleaning data involves identifying these errors and deciding whether to remove them or correct them before you start your calculations.
Quick Review:
• Check for typos.
• Calculate the "fences" using \( 1.5 \times IQR \).
• Decide if the outlier stays or goes based on the context.
Summary of Key Terms
Population: The whole group you are interested in.
Sample: A small part of the population you actually measure.
Frequency Density: The height of a histogram bar.
Central Tendency: A fancy way of saying "averages" (mean, median, mode).
Variation: A fancy way of saying "spread" (IQR, standard deviation).
Estimates: When data is grouped (e.g., "10-20 minutes"), we don't know the exact values, so our calculated mean is only an estimate based on the midpoint of the groups.