Welcome to Summary Measures!

Ever looked at a massive spreadsheet of numbers and felt your head spin? That's where summary measures come to the rescue! In this chapter, we learn how to take a giant pile of data and boil it down into just a few numbers that tell us a story. We focus on two main things: where the "middle" of the data is (central tendency) and how spread out the data is (variation).

Don't worry if Statistics feels a bit "wordy" compared to Pure Maths—it’s all about interpretation. By the end of these notes, you'll be able to describe any dataset like a pro.


1. Measures of Central Tendency: Finding the "Middle"

We use these to find a single value that represents the "typical" result in a dataset. The syllabus focuses on four main types:

The Mean (\(\bar{x}\))

The arithmetic mean is what most people call "the average." You add everything up and divide by how many items there are.

The Formula: \(\bar{x} = \frac{\sum x}{n}\)

Example: If five students score 10, 12, 15, 18, and 20 on a test, the mean is \(\frac{75}{5} = 15\).

Weighted Mean: Sometimes some numbers are more important than others. For example, if you are finding the average height of people in two different cities, you must "weight" the mean by the population of each city.

The Median

The median is the middle value when the data is written in order. If there is an even number of values, it's the average of the two middle ones.

Analogy: Think of the median strip on a highway—it's exactly in the middle, splitting the road in half!

The Mode

The mode is the most frequently occurring value. A dataset can have one mode, no mode, or be bimodal (two modes).

The Midrange

This is a quick-and-dirty measure found by averaging the highest and lowest values: \(\frac{\text{max} + \text{min}}{2}\). It's simple but very sensitive to extreme values.

Quick Review: Which one should I use?
Mean: Best for symmetrical data without big outliers.
Median: Best if the data is "skewed" (has a few very high or very low values) because it isn't pulled away by outliers.
Mode: Best for categorical data (e.g., "What is the most popular car color?").


2. Measures of Spread: How "Stretched" is the Data?

Two datasets can have the same mean but look totally different. Imagine two archers: Archer A hits the bullseye 5 times. Archer B hits 1 meter to the left twice and 1 meter to the right three times. Their "average" is the same, but Archer A is much more consistent!

Range

The simplest measure: \(\text{Highest value} - \text{Lowest value}\).
Common Mistake: The range is a single number (e.g., "15"), not two numbers (e.g., "10 to 25").

Quartiles and the Interquartile Range (IQR)

We can split our data into four equal quarters:
Lower Quartile (\(Q_1\)): The 25th percentile (one-quarter of the way in).
Median (\(Q_2\)): The 50th percentile (halfway).
Upper Quartile (\(Q_3\)): The 75th percentile (three-quarters of the way in).
IQR: This is \(Q_3 - Q_1\). It tells us the spread of the middle 50% of the data.

Did you know? The IQR is great because it ignores the extreme "weird" values at the ends and focuses on the consistent middle.

Standard Deviation (\(s\)) and Variance (\(s^2\))

The standard deviation is the most powerful measure of spread. It tells us the "average distance" of the data points from the mean.

Crucial Note for OCR MEI H640: We calculate the sample standard deviation. This means when we calculate the "average" of the squared distances, we divide by \(n-1\) instead of \(n\). This is because using \(n-1\) gives a better estimate for the whole population.

The Formulas:
1. Sum of squares (\(S_{xx}\)): \(S_{xx} = \sum (x_i - \bar{x})^2\)
2. Sample Variance (\(s^2\)): \(s^2 = \frac{S_{xx}}{n-1}\)
3. Sample Standard Deviation (\(s\)): \(s = \sqrt{\frac{S_{xx}}{n-1}}\)

Calculator Tip: Don't panic about the long formula! Your scientific or graphical calculator has a Statistics Mode. You just enter the data, and it will give you \(\bar{x}\) and \(s\) automatically. Look for the \(s_x\) or \(\sigma_{n-1}\) symbol on your screen.


3. Identifying Outliers and Cleaning Data

An outlier is a data point that feels "inconsistent" with the rest of the set—like a 7-foot-tall student in a primary school class.

How to spot an outlier (The Rules)

In the MEI syllabus, there are two standard ways to mathematically define an outlier. A question will usually tell you which one to use:

Rule 1 (The 1.5 × IQR Rule):
Anything smaller than \(Q_1 - (1.5 \times \text{IQR})\)
OR
Anything larger than \(Q_3 + (1.5 \times \text{IQR})\)

Rule 2 (The 2-Standard Deviation Rule):
Anything that is more than 2 standard deviations away from the mean.
Formula: \(\text{Outlier} > \bar{x} + 2s\) or \(\text{Outlier} < \bar{x} - 2s\)

Cleaning Data

When you find an outlier, you don't just delete it! You must clean the data. This means investigating if the outlier is:
1. An error: (e.g., someone typed "150" instead of "15"). In this case, correct or remove it.
2. A genuine extreme value: (e.g., a billionaire in a survey about wealth). In this case, keep it, but use the Median instead of the Mean to describe the data so the results aren't distorted.

Key Takeaway: Summary measures allow us to compare two groups easily. If Group A has a higher mean and a lower standard deviation than Group B, it means Group A is, on average, "better" and more "consistent."


Summary Checklist

• Do I know how to find the Mean, Median, and Mode?
• Can I use my calculator to find the Standard Deviation (\(s\))?
• Remember to divide by \(n-1\) for sample variance!
• Can I apply the \(1.5 \times \text{IQR}\) rule to find outliers?
• Do I understand that the Median/IQR are better for skewed data?

Don't worry if the standard deviation formula looks intimidating at first. Practice entering data into your calculator—it's your best friend for this chapter!