Welcome to Correlation and Linear Regression!

Ever wondered if there’s a genuine link between the hours you spend gaming and your reaction speeds? Or if the temperature outside actually predicts how much ice cream a shop will sell? That is exactly what this chapter is about! We are looking for relationships between two different sets of data and learning how to use those relationships to make smart predictions.

In Paper 1, the focus is on calculating these values using your calculator and, more importantly, interpreting what they mean in the real world. Don't worry if the formulas look scary in the textbook—for this exam, your calculator does the heavy lifting!


1. Understanding Correlation

Correlation describes the strength and direction of a relationship between two variables.

Pearson’s Product Moment Correlation Coefficient (PMCC or \(r\))

This is a measure of how close the points on a scatter graph are to a straight line.

  • Value of \(r\): It always falls between -1 and 1.
  • \(r = 1\): Perfect positive linear correlation (a perfect upward straight line).
  • \(r = -1\): Perfect negative linear correlation (a perfect downward straight line).
  • \(r = 0\): No linear correlation at all.

Quick Review: If \(r = 0.9\), the points are very close to a straight line going up. If \(r = 0.2\), they are very spread out but generally moving upwards.

Spearman’s Rank Correlation Coefficient (\(\rho\))

Sometimes, data isn't a straight line, but it still follows a trend (e.g., as one goes up, the other goes up, but in a curve). For this, we use Spearman’s Rank. Instead of using the raw numbers, we rank them (1st, 2nd, 3rd...).

How to Rank for Spearman's:

  1. Rank the first variable (e.g., 1 for the highest score, 2 for the second highest).
  2. Rank the second variable in the same way.
  3. Tied Ranks: If two items are tied for 3rd and 4th place, give them both the average rank: \( (3+4) \div 2 = 3.5 \).
  4. Consistency is Key: If you rank the highest as "1" for the first variable, you must rank the highest as "1" for the second variable too!

Did you know? Spearman’s Rank is great for "subjective" data, like two judges ranking contestants in a talent show. It doesn't matter if one judge gives 90/100 and the other gives 70/100; if they both put the same person in 1st place, Spearman's will show a perfect link!

Key Takeaway:

Pearson’s (\(r\)) measures linear (straight line) relationships. Spearman’s (\(\rho\)) measures agreement in rank (general trends) and is more flexible.


2. Choosing the Right Method

In the exam, you might be asked why you chose a specific method. Here is how to decide:

Use Pearson’s (\(r\)) when:

  • The relationship looks linear (a straight line) on a scatter graph.
  • The data comes from a bivariate normal distribution (this is a fancy way of saying the data is roughly "bell-shaped" when plotted in 3D, or that both variables are normally distributed).

Use Spearman’s (\(\rho\)) when:

  • The data is non-linear (e.g., it follows a curve).
  • The data is already in ranks or is qualitative (like "tasty," "tastier," "tastiest").
  • There are outliers. Spearman’s is much less affected by one weird result than Pearson’s is.
  • There are no assumptions about the distribution of the data.

Common Mistake: Students often forget that Pearson's only measures straight lines. If the data forms a perfect "U" shape, Pearson’s \(r\) might be 0, even though there is clearly a relationship!


3. Linear Regression: The Line of Best Fit

While correlation tells us if there is a link, Regression gives us the equation to make predictions. The standard form is:

\( y = a + bx \)

  • \(y\): The dependent variable (the thing you are trying to predict).
  • \(x\): The independent (or explanatory) variable (the thing you already know).
  • \(a\): The intercept. This is the value of \(y\) when \(x = 0\).
  • \(b\): The gradient. This tells you how much \(y\) increases (or decreases if \(b\) is negative) for every 1 unit increase in \(x\).

Step-by-Step Interpretation:
If an equation for ice cream sales (\(y\)) and temperature (\(x\)) is \( y = 20 + 5x \):
1. The intercept (20) means if the temperature is 0°C, we expect to sell 20 ice creams.
2. The gradient (5) means for every 1°C increase in temperature, we expect to sell 5 more ice creams.

Don't worry: You are expected to use your calculator to find \(a\) and \(b\). Make sure you know how to enter two-variable data (often under the 'STAT' or '6' menu on standard A-Level calculators).


4. Predictions: Safe vs. Risky

Once you have your \( y = a + bx \) line, you can plug in an \(x\) value to find a \(y\) value. But be careful!

Interpolation (The Safe Bet)

This is when you predict a value within the range of data you already have. If you measured temperatures between 10°C and 30°C, predicting for 20°C is interpolation. It is usually quite reliable.

Extrapolation (The Danger Zone)

This is when you predict a value outside your data range. Predicting ice cream sales at 50°C when your highest data point was 30°C is extrapolation.
Why is it risky? Because the trend might not continue! At 50°C, people might stay inside, and sales could actually drop. Avoid relying on extrapolation.

Key Takeaway:

Interpolation = inside the data range (Reliable).
Extrapolation = outside the data range (Unreliable/Risky).


5. Residuals and Outliers

A residual is simply the difference between what actually happened and what your regression line predicted.

\( \text{Residual} = y_i - (a + bx_i) \)

In simple terms: Residual = Actual value - Predicted value.

  • If the residual is positive, the actual data point is above the line.
  • If the residual is negative, the actual data point is below the line.
  • A very large residual (positive or negative) suggests that the data point might be an outlier.

Analogy: Imagine your GPS says your journey will take 30 minutes (Predicted), but it actually takes 45 minutes (Actual). Your "residual" is 15 minutes. If every other journey is within 1 minute of the prediction, this 15-minute gap makes that journey an outlier!

Quick Review Box:
- Pearson’s \(r\): Linear, range -1 to 1.
- Spearman’s \(\rho\): Ranks, non-linear trends.
- Equation \(y = a + bx\): \(a\) is the start, \(b\) is the change.
- Residual: Actual minus Predicted.


Congratulations! You've covered the core of Correlation and Regression for Paper 1. Remember: always interpret your answers in the context of the question (use the units like "kg," "meters," or "£") to get those top marks!