Welcome to the World of Data Analysis!

In this chapter, we are moving beyond just collecting numbers. We are learning how to process, visualize, and understand what those numbers are actually telling us. Think of a statistician like a detective: the data is the evidence, and the tools in this chapter are how you solve the case!

Since this is the Higher Tier content, we will look at some advanced techniques that help us compare different sets of data and make very accurate predictions. Don't worry if some of the formulas look big at first—we will break them down step-by-step.

1. Representing Data: Beyond Simple Charts

You already know bar charts and pictograms, but for Higher Tier, we need to compare datasets and look at the "shape" of the data.

Comparative Pie Charts

When comparing two different sized groups (e.g., a small school vs. a large school) using pie charts, we don't just make them the same size. We make the area of the circle represent the total frequency.

The Trick: To find the radius of a new pie chart, use this relationship:
\( \frac{\text{Area}_1}{\text{Area}_2} = \frac{\text{Total Frequency}_1}{\text{Total Frequency}_2} \)

Because Area is related to the radius squared (\( r^2 \)), the ratio of the radii is the square root of the ratio of the frequencies.
Example: If Chart B has 4 times the data of Chart A, its radius should be twice as big (\( \sqrt{4} = 2 \)).

Histograms (Unequal Class Widths)

In a histogram, the area of the bar represents the frequency, not the height. This is vital when your groups (class widths) aren't the same size.

Key Formula:
\( \text{Frequency Density} = \frac{\text{Frequency}}{\text{Class Width}} \)

Quick Review: Always label the y-axis as "Frequency Density" unless the widths are equal!

Understanding Skewness

Skewness tells us if our data is "leaning" to one side.
Positive Skew: Most data is at the low end (a long tail to the right).
Negative Skew: Most data is at the high end (a long tail to the left).

You can calculate skewness using this formula (provided in your exam sheet):
\( \text{Skew} = \frac{3(\text{mean} - \text{median})}{\text{standard deviation}} \)

Key Takeaway: If the mean > median, the data usually has a positive skew. If the median > mean, it usually has a negative skew.

2. Advanced Averages (Central Tendency)

We usually talk about the "Big Three" (mean, median, mode), but Higher Tier students need a few more tools.

Weighted Mean

Used when some numbers are more important than others.
Analogy: Your final grade might be 20% homework and 80% exam. The exam "weighs" more!
\( \text{Weighted Mean} = \frac{\sum (value \times weight)}{\sum weights} \)

Geometric Mean

This is used specifically for growth rates or percentages. If you want to find the average interest rate over five years, you use this.
\( \text{Geometric Mean} = \sqrt[n]{x_1 \times x_2 \times ... \times x_n} \)

Which average to use?

  • Mean: Best for symmetrical data with no outliers.
  • Median: Best if the data is skewed or has outliers (it's "resistant" to extreme values).
  • Mode: Best for non-numerical (qualitative) data like "favorite color."

Key Takeaway: The mean is the most sensitive—if one person in a room is a billionaire, the mean wealth shoots up, but the median stays the same!

3. Measuring Spread (Dispersion)

Knowing the average isn't enough. We need to know if the data is all bunched together or spread out.

Standard Deviation (\( \sigma \))

This is the "gold standard" of spread. It tells us the average distance of the data points from the mean.

Don't panic: You will be given the formula! Just remember:
1. Large \( \sigma \) = Data is very spread out.
2. Small \( \sigma \) = Data is consistent and close to the mean.

Identifying Outliers

An outlier is a piece of data that doesn't fit the pattern. We use math to "officialise" if a value is an outlier:
1. The IQR Rule: A value is an outlier if it is:
Lower than \( LQ - (1.5 \times IQR) \) OR Higher than \( UQ + (1.5 \times IQR) \)

2. The Standard Deviation Rule: Any value outside \( \mu \pm 3\sigma \) (more than 3 standard deviations from the mean) is usually an outlier.

Standardized Scores (Z-Scores)

How do you compare a score in a hard Maths test to a score in an easy English test? You use Z-scores! It tells you how many standard deviations away from the mean a value is.
\( \text{Standardized Score} = \frac{x - \mu}{\sigma} \)

Key Takeaway: A positive Z-score is above average; a negative Z-score is below average. A Z-score of 0 is exactly average.

4. Correlation and Regression

This is all about the relationship between two variables (bivariate data).

Spearman’s Rank vs. PMCC

  • PMCC (Pearson’s): Measures the strength of a linear (straight line) relationship. Value is between -1 and +1.
  • Spearman’s Rank: Measures how well the ranks match up. Use this if the data is not a straight line but still goes in one direction (non-linear).

The Regression Line

The "Line of Best Fit" has an equation: \( y = a + bx \).
- \( a \) is the intercept (where it hits the y-axis).
- \( b \) is the gradient (how much \( y \) changes for every 1 unit of \( x \)).

Common Mistake to Avoid: Extrapolation. This is predicting data outside the range you measured. It is very risky because the trend might change!

Key Takeaway: Correlation does not equal Causation! Just because ice cream sales and shark attacks both go up in summer doesn't mean ice cream causes shark attacks. They are both linked to a third factor: warm weather.

5. Time Series and Quality Assurance

Stats isn't just a snapshot; it's often a movie showing changes over time.

Moving Averages

Data like "daily temperature" jumps up and down a lot (this is called "noise"). A 4-point moving average smooths out these bumps to show the underlying trend.

Quality Control Charts

Factories use these to make sure machines aren't breaking.
- Warning Lines: Usually at \( \pm 2\sigma \). If a point hits this, you keep a close eye on it.
- Action Lines: Usually at \( \pm 3\sigma \). If a point hits this, stop the machine! Something is wrong.

Did you know? In a normal process, only 1 in every 20 points should fall outside the warning lines by pure chance.

6. Estimation: The Petersen Capture-Recapture

How do you count fish in a lake without catching them all?
1. Catch a group, mark them (\( M \)), and release them.
2. Later, catch a second group (\( n \)).
3. Count how many in the second group are marked (\( m \)).

The Formula:
\( \text{Total Population (N)} = \frac{M \times n}{m} \)

Assumptions you must know:
- The marks didn't fall off.
- No animals were born or died between catches.
- The marked animals mixed back in completely.

Final Encouragement: Statistics is about telling a story with numbers. Don't let the symbols scare you—they are just a shorthand for these simple ideas. You've got this!