Welcome to Data Cleaning!

In your Statistics journey, you've learned how to find averages and measures of spread. But what happens when your data looks a bit... weird? Maybe you're measuring the heights of students and you find someone who is 12 feet tall, or a value is simply missing from your list.

In this chapter, we are going to learn how to spot these "odd ones out" (called outliers) and how to "scrub" our data to make sure our results are accurate and reliable. Don't worry if this seems a bit technical at first; it's mostly about following a few simple rules to keep your data honest!

1. What are Outliers?

An outlier is a data point that is significantly different from the rest of the dataset. Imagine you are recording the prices of cars in a local parking lot. Most are between £5,000 and £30,000, but then there is one gold-plated supercar worth £2,000,000. That supercar is an outlier.

Why do Outliers happen?

Outliers usually come from three places: 1. Errors: Someone typed an extra zero by mistake (e.g., writing 100 instead of 10). 2. Natural Variation: Sometimes the world just produces an extreme result (like an Olympic athlete’s running speed). 3. Sampling Issues: You accidentally measured something that wasn't supposed to be in your group.

Quick Takeaway: Outliers are the "rebels" of your data—values that don't fit the general pattern.

2. How to Spot Outliers (The Mathematical Rules)

In your OCR exam, you can't just look at a number and say "that looks too big." You need to prove it mathematically. There are two main ways to define an outlier in the H230 syllabus.

Method A: The Quartile Rule

This is the most common method used with box plots. It uses the Interquartile Range (IQR), which is the middle 50% of your data.

An outlier is any value that is:
• Smaller than \( Q_1 - (1.5 \times \text{IQR}) \)
• Larger than \( Q_3 + (1.5 \times \text{IQR}) \)

Step-by-Step Example:
Suppose you have: \( Q_1 = 20 \), \( Q_3 = 30 \).
1. Calculate IQR: \( 30 - 20 = 10 \).
2. Calculate the "1.5 multiplier": \( 1.5 \times 10 = 15 \).
3. Find the Lower Bound: \( 20 - 15 = 5 \).
4. Find the Upper Bound: \( 30 + 15 = 45 \).
Any value below 5 or above 45 is officially an outlier!

Method B: The Standard Deviation Rule

This method is often used when the data follows a more "normal" or symmetric pattern.
An outlier is any value that is more than 2 standard deviations away from the mean.

The formula for the boundaries is: \( \text{mean} \pm (2 \times \sigma) \)

Quick Review:
• \( \sigma \) (sigma) = Standard Deviation
• \( \mu \) (mu) or \( \bar{x} \) = Mean
If the mean is 100 and the SD is 10, your "safe zone" is \( 100 \pm 20 \). So, anything below 80 or above 120 is an outlier.

Key Takeaway: Always check which rule the question asks you to use. If they give you quartiles, use the 1.5 x IQR rule. If they give you the mean and SD, use the 2 x SD rule.

3. Cleaning Data

Once we've found our outliers or noticed our data is "messy," we need to clean it. This is like proofreading an essay before you hand it in.

What does "Cleaning" involve?

Cleaning data (also known as data scrubbing) involves dealing with three main headaches:
1. Outliers: Deciding whether to keep them or remove them. If it's a typing error, delete or fix it. If it's a real but extreme value, you might keep it but note its effect.
2. Missing Data: Sometimes a participant forgets to answer a question. You have to decide if you should ignore that person entirely or try to estimate the missing value.
3. Errors: Finding impossible values, like a "weight" recorded as "negative 5kg" or a "date of birth" in the year 2099.

Did you know? In real-world data science, "cleaning" often takes up 80% of a statistician's time!

4. Critiquing Data Presentation

As an AS Level student, you are expected to look at how data is presented (like in a histogram or a scatter diagram) and judge if it's done well. This is called critiquing.

Common things to look for:

Does the outlier ruin the scale? If you have one massive outlier, the rest of the data might look squashed into a tiny corner of the graph.
Is the diagram misleading? Does the Y-axis start at zero? If not, the differences between bars might look much bigger than they actually are.
Choice of Average: If there are massive outliers, the mean gets "pulled" toward them. In these cases, the median is usually a better measure of the "average."

Common Mistake to Avoid: Don't just say a graph is "bad." Use statistical language. Say: "The presence of a significant outlier has skewed the mean, making it an unrepresentative measure of central tendency."

Summary: The Quick Review Box

1. Outlier (Quartile Rule): \( < Q_1 - 1.5\text{IQR} \) or \( > Q_3 + 1.5\text{IQR} \).
2. Outlier (SD Rule): More than 2 standard deviations from the mean.
3. Cleaning: Removing errors, fixing typos, and deciding what to do with missing values.
4. Choosing Averages: Use the median if there are extreme outliers, as it is "resistant" to them!

Keep practicing these calculations! Once you get the hang of the "1.5 x IQR" steps, you'll be able to spot outliers in your sleep. You've got this!