Introduction: Why Summarise Data?

Imagine you’re looking at the heights of 1,000 students. If someone asks you, "How tall are the students at your school?", you wouldn't read out every single measurement! Instead, you’d use a couple of numbers to describe the whole group. This is where measures of average (to show the center) and measures of spread (to show the variety) come in. These tools help us compare different data sets and make sense of the world, from weather patterns to exam results.

1. Measures of Average (Central Tendency)

An "average" is a single value that attempts to describe a set of data by identifying the central position within that data. In the H240 syllabus, you need to be comfortable with three main types.

The Mean (\(\bar{x}\))

The mean is what most people mean when they say "average." You find it by adding all the values together and dividing by how many there are.

The Formula: \(\bar{x} = \frac{\sum x}{n}\)
Where \(\sum x\) means the "sum of all values" and \(n\) is the number of values.

Pros: It uses every single piece of data.
Cons: It can be "pulled" away from the center by one or two very high or very low numbers (outliers).

The Median

The median is the middle value when the data is listed in order. If you have an even number of values, the median is the average of the two middle ones.

Analogy: Think of the "median" strip in the middle of a road—it splits the data exactly in half.

The Mode

The mode is the value that appears most often. You can have more than one mode (bimodal) or no mode at all if all values are unique.

Quick Review:
- Mean: The "Leveler" (shares the total equally).
- Median: The "Middle" (50% above, 50% below).
- Mode: The "Popular" (the most frequent).

2. Measures of Spread (Variation)

Average tells us where the center is, but spread tells us how consistent the data is. Are the heights all very similar, or is there a huge difference between the shortest and tallest student?

Quartiles and the Inter-Quartile Range (IQR)

Just like the median splits data into two halves, quartiles split the data into four quarters.

- Lower Quartile (\(Q_1\)): The 25th percentile (one-quarter of the way through).
- Upper Quartile (\(Q_3\)): The 75th percentile (three-quarters of the way through).
- Inter-Quartile Range (IQR): \(Q_3 - Q_1\).

Why use IQR? Unlike the full range, the IQR ignores the extreme ends of the data and focuses on the middle 50%. It's much more reliable if your data has weird outliers.

Variance and Standard Deviation (\(\sigma\))

The Standard Deviation is the "Gold Standard" of spread in A Level Maths. It measures, on average, how far each data point is from the mean.

The "Step-by-Step" for Standard Deviation:
Don't worry if this seems tricky! Your calculator will do most of the heavy lifting, but you must understand the formula:

\(\sigma = \sqrt{\frac{\sum x^2}{n} - \bar{x}^2}\)

Memory Aid: A common way to remember this is "The Square Root of (The Mean of the Squares minus the Square of the Mean)."

The Variance: This is simply the Standard Deviation squared (\(\sigma^2\)). It’s the value before you take the final square root.

Common Mistake: Forgetting to take the square root at the end of the calculation! If your answer for spread looks massive compared to your data, you probably found the variance instead of the standard deviation.

3. Working with Grouped Data

Sometimes data is given in classes (e.g., "5 students are between 140cm and 150cm"). Because we don't know the exact heights, we use the midpoint of each class as an estimate for \(x\).

The Estimated Mean: \(\bar{x} \approx \frac{\sum fx}{\sum f}\)
(Multiply each frequency \(f\) by its midpoint \(x\), sum them up, and divide by the total frequency).

The Estimated Standard Deviation: \(\sigma \approx \sqrt{\frac{\sum fx^2}{\sum f} - \bar{x}^2}\)

Did you know? Because we use midpoints, any calculation from grouped data is always an estimate, not an exact value.

4. Outliers and Cleaning Data

An outlier is a data point that is so far away from the others that it might be an error or a very rare case. In the H240 syllabus, there are two common rules for identifying them:

1. The IQR Rule: Any value more than \(1.5 \times \text{IQR}\) above \(Q_3\) or below \(Q_1\).
2. The SD Rule: Any value more than \(2\) standard deviations away from the mean.

Data Cleaning: This involves deciding whether to remove outliers (if they are errors) or keep them (if they are genuine extreme cases).

5. Comparing Distributions

When an exam question asks you to "compare two distributions," you must mention two things in context:

1. Compare the Average: Use the mean or median. (e.g., "Class A had a higher median score than Class B, suggesting they performed better on average.")
2. Compare the Spread: Use the Standard Deviation or IQR. (e.g., "Class B had a lower Standard Deviation, meaning their results were more consistent.")

Key Takeaway:
- High Spread = Inconsistent data.
- Low Spread = Consistent/Reliable data.

Summary Checklist

- Can you calculate the mean and standard deviation on your calculator? (Check your manual!)
- Do you know the difference between variance and standard deviation?
- Can you identify an outlier using the \(1.5 \times \text{IQR}\) rule?
- When comparing data, have you mentioned both an average and a spread?