Welcome to Correlation and Regression!
Ever wondered if there’s a link between how much you study and the grades you get? Or if taller people generally have bigger feet? In this chapter of Unit S1, we’ll learn how to measure these relationships mathematically. We are moving from looking at just one set of numbers to looking at bivariate data—which is just a fancy way of saying "data with two variables."
Don’t worry if you’re not a "maths person" yet. We’ll break this down into simple steps, from drawing dots on a graph to predicting the future with equations!
1. The Basics: Variables and Scatter Diagrams
Before we calculate anything, we need to know what we are looking at. When we have two variables, we usually label them x and y.
Explanatory vs. Response Variables
• Explanatory Variable (x): This is the independent variable. It’s the one we think might be causing a change. We always plot this on the horizontal axis.
• Response Variable (y): This is the dependent variable. It’s the one we are measuring to see how it reacts to x. We plot this on the vertical axis.
Example: If you are studying how sunlight affects plant growth, "Sunlight" is the explanatory variable (x) and "Height of the plant" is the response variable (y).
Scatter Diagrams
A scatter diagram is simply a graph where each pair of data points \((x, y)\) is plotted as a single dot. It helps us see the "shape" of the relationship.
• Positive Correlation: The dots go "up" from left to right. As x increases, y increases.
• Negative Correlation: The dots go "down" from left to right. As x increases, y decreases.
• No Correlation: The dots are scattered everywhere like a cloud of flies. There is no clear link.
Quick Review Box:
Always check your axes! x explains, y responds.
Summary Takeaway: Scatter diagrams are our first look at data. They show us the direction (positive or negative) and the strength of a relationship visually.
2. Measuring Correlation: The PMCC (r)
Visuals are great, but mathematicians want a number. That number is the Product Moment Correlation Coefficient, or r for short.
What does 'r' tell us?
The value of r always sits between -1 and +1.
• \(r = +1\): Perfect positive linear correlation (all points are on a straight line pointing up).
• \(r = -1\): Perfect negative linear correlation (all points are on a straight line pointing down).
• \(r = 0\): No linear correlation at all.
• The closer to 1 or -1, the stronger the relationship.
Memory Aid: Think of r as the "Strength of the Bond."
0.9 is a "Best Friend" (Strong), 0.3 is an "Acquaintance" (Weak), and 0 is a "Stranger" (No link).
Common Pitfall: Correlation vs. Causation
Did you know? Just because two things have a high r value doesn't mean one causes the other. For example, ice cream sales and shark attacks both increase in the summer. They are correlated, but eating ice cream does not cause shark attacks! They both respond to a third factor: warm weather.
Summary Takeaway: The PMCC (r) measures the strength and direction of a linear (straight-line) relationship. It does not prove that one thing causes another!
3. Linear Regression: The "Line of Best Fit"
If there is a linear correlation, we can draw a straight line through the data. In S1, we use the Least Squares Regression Line. The equation looks like this:
\(y = a + bx\)
What do 'a' and 'b' mean?
• b (The Gradient): This tells us how much y changes for every 1 unit increase in x. If b is 2, then every time x goes up by 1, y goes up by 2.
• a (The Intercept): This is the value of y when \(x = 0\). In real-world terms, it’s the "starting value."
The Method of Least Squares
You don't need to derive the formulas, but you need to know how to use them from the formula booklet. You will often calculate summary statistics first:
\(S_{xx} = \sum x^2 - \frac{(\sum x)^2}{n}\)
\(S_{xy} = \sum xy - \frac{(\sum x)(\sum y)}{n}\)
Then:
\(b = \frac{S_{xy}}{S_{xx}}\)
\(a = \bar{y} - b\bar{x}\)
(Where \(\bar{x}\) and \(\bar{y}\) are the mean values of x and y).
Step-by-Step: Drawing the Regression Line
1. Find the mean point \((\bar{x}, \bar{y})\). The line always passes through this point.
2. Pick a value for x (like \(x=0\) to find a), calculate the y value.
3. Plot these two points and join them with a ruler.
Summary Takeaway: The regression line \(y = a + bx\) is a mathematical model used to predict the value of a response variable (y) based on an explanatory variable (x).
4. Making Predictions: Interpolation vs. Extrapolation
The whole point of the regression line is to predict things. But we have to be careful!
Interpolation (The Safe Zone)
This is when you use the line to predict a y value for an x that is inside the range of data you already have. This is usually very reliable.
Extrapolation (The Danger Zone)
This is when you try to predict a y value for an x that is outside your data range.
Analogy: If you measure a baby's growth from age 0 to 1, and use that line to predict their height at age 50, your line might say they will be 10 meters tall!
Common Mistake: Students often trust extrapolation. In exams, if you are asked if a prediction is reliable and the x value is outside the data range, always say: "No, this is extrapolation and may not be reliable."
Quick Review Box:
• Inside data range = Interpolation = Reliable.
• Outside data range = Extrapolation = Unreliable.
Summary Takeaway: Only use your regression model for values within the range of your original data to ensure accuracy.
5. Coding (Change of Variable)
Sometimes the numbers are huge or have too many decimals. We use coding to make them simpler (e.g., \(p = x - 100\)).
• Coding does NOT change the PMCC (r). The strength of the relationship stays the same regardless of the units.
• If you have a regression line for coded data, remember to substitute back to get the final answer in terms of the original variables.
Summary Takeaway: Coding is just a shortcut for calculation. It affects the regression equation's 'a' and 'b' values, but it never changes the correlation coefficient r.
Final Tips for Success
• Don't Panic: If the formulas look scary, remember they are all in the provided Formula Booklet. You just need to know which values to plug in.
• Check your signs: A negative \(S_{xy}\) means a negative correlation. If your r is negative but your b is positive, you’ve made a calculation error!
• Context is key: Always mention the real-world variables (e.g., "weight" and "height") in your final interpretation, not just "x" and "y".