Welcome to the World of Data!

Ever wondered how companies decide which products to sell, or how scientists prove a new medicine works? It all starts with Single Variable Data. In this chapter, we aren’t just looking at numbers; we are learning how to tell the "story" behind those numbers. Whether you're a math whiz or find numbers a bit intimidating, don't worry! We’ll break everything down into bite-sized pieces.

1. Visualising Data: The Power of Diagrams

Before we do any calculations, we need to see what the data looks like. The OCR syllabus expects you to interpret several types of charts. Each has its own "superpower."

Key Diagrams You Need to Know:

1. Vertical Line Charts & Dot Plots: Great for small datasets where you want to see every single individual point.
2. Bar Charts: Perfect for categorical data (like eye colour or favourite pizza toppings).
3. Stem-and-Leaf Diagrams: These are unique because they show the shape of the data but keep the original values visible. Remember to always look for the Key!
4. Box-and-Whisker Plots: These show the "five-number summary" (Minimum, Lower Quartile, Median, Upper Quartile, and Maximum). They are the best tool for comparing two different sets of data side-by-side.
5. Cumulative Frequency Diagrams: Used to estimate the median and quartiles for grouped data.

The Histogram (The Big Boss of Diagrams)

Histograms look like bar charts, but they are different! In a histogram, the Area of the bar represents the Frequency, not just the height.

The Golden Rule: \( \text{Frequency} = \text{Class Width} \times \text{Frequency Density} \)

Analogy: Imagine bars are like different sized containers. To know how much "water" (frequency) is inside, you need to know both how wide the container is and how high the water level (density) is.

Quick Review: Which diagram should I use?
• To keep original values: Stem-and-Leaf.
• To compare spreads: Box Plot.
• For grouped continuous data: Histogram.

2. Measures of Central Tendency (The "Middle")

This is about finding the "typical" value in your data.

Mean (\(\bar{x}\)): The arithmetic average. \( \bar{x} = \frac{\sum x}{n} \).
Median: The middle value when data is in order. It’s "resistant" to outliers (it doesn't care if one value is a million miles away).
Mode: The most common value.

Memory Aid:
MOde is the MOst frequent.
MEdian is in the MIddle (like the median strip on a road).
• Mean is the "mean" one because it makes you do the most calculation!

3. Measures of Spread (Variation)

Knowing the average isn't enough. We need to know if the data is all bunched together or spread out like a mess!

Quartiles and Inter-Quartile Range (IQR)

Quartiles split your data into four equal quarters.
Lower Quartile (\(Q_1\)): 25% of the way through.
Upper Quartile (\(Q_3\)): 75% of the way through.
IQR (\(Q_3 - Q_1\)): This tells you how spread out the middle 50% of the data is. It ignores the extreme "weird" values at the ends.

Variance and Standard Deviation

These are more sophisticated. They look at every single data point to see how far, on average, they are from the mean.

Standard Deviation (\(\sigma\)): This is the "Root Mean Square Deviation." It’s basically the average distance from the mean. A low standard deviation means the data is consistent; a high one means it's all over the place.

The Formulas (Don't panic!):

From a list of data:
\( \sigma = \sqrt{\frac{\sum (x - \bar{x})^2}{n}} \) or the easier version: \( \sigma = \sqrt{\frac{\sum x^2}{n} - \bar{x}^2} \)

Pro-Tip: Use your calculator's Statistics Mode! OCR expects you to know how to use your calculator to find these values quickly.

Key Takeaway: Use Mean and Standard Deviation together for consistent data. Use Median and IQR if your data has crazy outliers that would "pull" the mean too far.

4. Outliers and Cleaning Data

Sometimes data contains mistakes or very strange values called outliers. You can't just ignore them because you feel like it; you need a mathematical rule!

How to spot an Outlier (OCR Definitions):

A value is usually an outlier if:
1. It is more than 1.5 × IQR away from the nearest quartile.
(e.g., Higher than \(Q_3 + 1.5 \times \text{IQR}\) or Lower than \(Q_1 - 1.5 \times \text{IQR}\))
2. It is more than 2 × Standard Deviation away from the mean.
(e.g., Higher than \(\bar{x} + 2\sigma\) or Lower than \(\bar{x} - 2\sigma\))

Cleaning the Data

Cleaning data involves dealing with missing values, errors, or outliers. If a value is a clear mistake (like someone's age being recorded as 200), we remove it. This is vital because "Garbage In = Garbage Out!"

Did you know? Data cleaning can take up to 80% of a real data scientist's time! Calculating the final answer is actually the short part.

5. Comparing Distributions

In the exam, you will often be asked to "Compare these two sets of data." To get full marks, you must comment on two things, using values from the data:

1. A measure of central tendency: Compare the Medians or Means. (e.g., "Group A had a higher median score than Group B, suggesting they performed better on average.")
2. A measure of spread: Compare the IQRs or Standard Deviations. (e.g., "Group B has a smaller IQR, meaning their results were more consistent than Group A.")

Common Mistake to Avoid: Never just list the numbers. You must interpret them in the context of the question (e.g., talk about "test scores" or "plant heights," not just "the data").

Summary: Key Takeaways

Histograms: Area = Frequency. Check your frequency density!
Standard Deviation: Tells you about consistency. Use your calculator.
Outliers: Use the \(1.5 \times \text{IQR}\) rule or \(2\sigma\) rule to prove something is an outlier.
Comparison: Always talk about Average AND Spread in context.

Don't worry if this seems like a lot of formulas at first. The more you practice "reading" the diagrams, the more natural it becomes. You've got this!