Introduction: Welcome to the World of Data!

Ever wondered how companies predict what you’ll buy, or how scientists make sense of thousands of medical results? It all starts with Data Presentation and Interpretation. In this chapter, we’re going to learn how to take a messy pile of numbers and turn them into clear, meaningful pictures and summaries. Don't worry if Statistics feels a bit "different" from Pure Maths—think of it as using numbers to tell a story about the real world!

1. Visualising Data: The Big Picture

Before we calculate anything, we usually want to "see" the data. Different diagrams tell us different things.

Histograms

Unlike the bar charts you used in school, the area of the bars in a histogram represents the frequency, not just the height. This is vital when the "class widths" (the groups on the bottom) are different sizes.

Key Formula:
\( \text{Frequency Density} = \frac{\text{Frequency}}{\text{Class Width}} \)

Analogy: Think of Frequency Density like "crowdedness." If you have 10 people in a tiny room, it feels very crowded (high density). If you have 10 people in a football stadium, it feels empty (low density). The "area" is the total number of people.

Box and Whisker Plots

These are fantastic for seeing the "spread" of data. They show the minimum, lower quartile (Q1), median (Q2), upper quartile (Q3), and maximum.

Quick Review:
- Median: The middle value.
- Interquartile Range (IQR): \( Q3 - Q1 \). This tells you how spread out the middle 50% of the data is.

Cumulative Frequency

This is a "running total" graph. It always goes up! We use it to estimate the median and percentiles for grouped data.

Summary: Diagrams help us spot patterns and outliers quickly. Always check the labels on your axes!

2. Measures of Central Tendency (The "Averages")

We use these to find a "typical" value in our data set.

  • Mean (\( \bar{x} \)): The sum of all values divided by the number of values. It uses every piece of data but can be "dragged" away by extreme values.
  • Median: The middle value. It is very resistant to outliers.
  • Mode: The most common value. Great for non-numerical data (like favorite colors).

Common Mistake: Using the mean for data that has a few massive outliers (like billionaire salaries in a normal office). In that case, the median is a much better "typical" value!

3. Measures of Variation (The "Spread")

Knowing the average isn't enough. We need to know if the data is all bunched together or spread far apart.

Standard Deviation

This is the most common measure of spread in A Level. It tells us, on average, how far the data points are from the mean.

The "Short-cut" Formula for \( S_{xx} \):
\( S_{xx} = \sum x^2 - \frac{(\sum x)^2}{n} \)

Standard Deviation (\( \sigma \)):
\( \sigma = \sqrt{\frac{S_{xx}}{n}} \)

Memory Aid: To remember the order of the formula, think "Mean of the squares minus square of the mean."

Coding

Sometimes data is huge (e.g., 100,005, 100,010). We can "code" it to make it smaller (e.g., subtract 100,000).
If we use the code \( y = \frac{x - a}{b} \):
1. The Mean follows the code: \( \bar{y} = \frac{\bar{x} - a}{b} \).
2. The Standard Deviation is ONLY affected by multiplying/dividing: \( \sigma_y = \frac{\sigma_x}{b} \). (Adding or subtracting doesn't change how "spread out" things are!)

Summary: Standard deviation measures spread. Coding makes calculations easier without losing the underlying pattern.

4. Correlation and Regression

This is for Bivariate Data (data with two variables, like height and weight).

Scatter Diagrams

  • Explanatory (Independent) Variable: Usually on the \( x \)-axis. This is the one we think "explains" the change.
  • Response (Dependent) Variable: Usually on the \( y \)-axis. This is the one we are measuring.

Correlation vs. Causation

Did you know? Shark attacks and ice cream sales are highly correlated. Does eating ice cream cause shark attacks? No! Both increase because it's summer. This is why we say "Correlation does not imply causation."

Regression Lines

A regression line (line of best fit) allows us to make predictions.
- Interpolation: Predicting inside the range of data we have. This is usually reliable.
- Extrapolation: Predicting outside the range of data. Warning! This is very dangerous and often inaccurate.

Summary: Regression lines are for predicting, but don't trust them too far outside your data range!

5. Cleaning Data and Identifying Outliers

Real-world data is messy. It has errors, missing values, and weird anomalies called outliers.

Finding Outliers

You will usually be given a rule to find outliers. Common ones are:
1. Values more than \( 1.5 \times IQR \) above \( Q3 \) or below \( Q1 \).
2. Values more than \( 3 \times \text{standard deviation} \) away from the mean.

Data Cleaning

If you find an error (like a person's height listed as 20 meters), you should remove it. This is called cleaning the data. If it's a genuine but weird value, you might keep it but note it as an outlier.

Takeaway: Always look at your data before you calculate. If a number looks "impossible," it probably is!

Final Tips for Success

1. Read the context: If the question is about "Daily Mean Temperature," your answer shouldn't be 500 degrees!
2. Calculator Skills: Learn how to enter data into the "Statistics" mode of your calculator. It can calculate the mean and standard deviation for you in seconds!
3. Don't Panic: If a formula looks scary, break it down step-by-step. Most of these marks are for following the process correctly.