Welcome to Correlation and Regression!
In this chapter, we are going to explore how two different sets of data might be related to each other. For example, does spending more time on revision lead to higher exam marks? Or does the temperature outside affect how many ice creams are sold? Correlation helps us measure the strength of these relationships, and Regression allows us to predict one value if we know the other. Don't worry if these terms sound fancy—by the end of these notes, you'll be using them like a pro!
1. Scatter Diagrams and Variables
Before we do any math, we usually draw a picture of our data. This is called a Scatter Diagram. It helps us see if there is a pattern.
Explanatory and Response Variables
To draw a scatter diagram, we need to decide which variable goes on which axis:
- Explanatory Variable (Independent): This is the variable that "explains" the change. We plot this on the \(x\)-axis. Think of this as the "input."
- Response Variable (Dependent): This is the variable that "responds" to the change. We plot this on the \(y\)-axis. Think of this as the "output" or the result.
Example: If you are investigating how "Hours of Sunshine" affects "Ice Cream Sales," the sunshine is the explanatory variable (\(x\)) and the sales are the response variable (\(y\)).
Quick Review: Remember the phrase "\(x\) explains \(y\)" to help you remember which goes where!
2. Correlation: Measuring the Relationship
Correlation tells us two things: the direction and the strength of a linear relationship between two variables.
The Product Moment Correlation Coefficient (PMCC)
The PMCC is a value, represented by the letter \(r\), that tells us exactly how strong the correlation is. You don't need to derive the formula for the exam, but you do need to know how to interpret the result.
- \(r = +1\): Perfect positive linear correlation (all points are in a perfectly straight line going up).
- \(r = -1\): Perfect negative linear correlation (all points are in a perfectly straight line going down).
- \(r = 0\): No linear correlation at all (the points look like a random cloud).
Did you know? The PMCC only measures linear relationships (straight lines). If your data follows a curve like a "U" shape, the PMCC might be 0, even though there is clearly a pattern!
Interpreting the Strength
In the exam, you'll often be asked to describe the correlation. Use these "strength" words:
- 0.7 to 1.0: Strong positive correlation.
- 0.3 to 0.7: Weak/Moderate positive correlation.
- -0.3 to -0.7: Weak/Moderate negative correlation.
- -0.7 to -1.0: Strong negative correlation.
Common Mistake to Avoid: Correlation is NOT Causation! Just because two things are correlated doesn't mean one causes the other. For example, ice cream sales and shark attacks are both correlated (because they both increase in summer), but eating ice cream does not cause shark attacks!
3. Linear Regression: The Line of Best Fit
If there is a linear correlation, we can draw a Regression Line. In Statistics 1, we use the Least Squares Regression Line. This is the line that makes the total distance between all the points and the line as small as possible.
The Equation of the Line
The equation is written as: \(y = a + bx\)
- \(b\): The gradient (how much \(y\) increases for every 1 unit increase in \(x\)).
- \(a\): The intercept (where the line crosses the \(y\)-axis).
How to calculate \(a\) and \(b\)
You will use summary statistics like \(S_{xx}\) and \(S_{xy}\) which are provided in your formula booklet. The steps are usually:
- Calculate the gradient: \(b = \frac{S_{xy}}{S_{xx}}\)
- Calculate the intercept: \(a = \bar{y} - b\bar{x}\) (where \(\bar{x}\) and \(\bar{y}\) are the means of the data).
Key Point: The regression line always passes through the point of the means \((\bar{x}, \bar{y})\). This is a great way to check if you have drawn your line correctly on a scatter diagram!
4. Using the Line for Predictions
The whole point of finding the equation \(y = a + bx\) is so we can predict a value of \(y\) for a given value of \(x\).
Interpolation vs. Extrapolation
This is a very common exam topic! You need to know if your prediction is reliable.
- Interpolation: Making a prediction within the range of the data you already have. This is usually reliable.
- Extrapolation: Making a prediction outside the range of your original data (e.g., if your data is for temperatures between 10°C and 20°C, predicting for 40°C is extrapolation). This is unreliable because we don't know if the trend continues.
Analogy: Imagine you are watching a seedling grow 1cm every day for a week. Interpolation is guessing how tall it was on day 4 (Safe). Extrapolation is guessing how tall it will be in 10 years (Risky—it will eventually stop growing!).
5. Coding (Change of Variable)
Sometimes, data is "coded" to make it easier to handle (e.g., subtracting a constant or dividing by a number). You need to know how this affects your results.
- PMCC (\(r\)): Coding does not change the PMCC. If the relationship is strong, it stays strong regardless of the units!
- Regression Line: Coding does change the equation. If you calculate a regression line using coded data, you must substitute the coding formulas back in to get the final relationship for the original variables.
Summary Takeaways
1. Explanatory variable (\(x\)) is the input; Response variable (\(y\)) is the output.
2. PMCC (\(r\)) measures the strength of a linear relationship from -1 to +1.
3. Correlation does not prove that one thing causes another.
4. The regression line \(y = a + bx\) always passes through the mean point \((\bar{x}, \bar{y})\).
5. Avoid extrapolation—predicting outside your data range is dangerous and unreliable!
Don't worry if the calculations for \(S_{xx}\) and \(S_{xy}\) look scary at first. Most of the time, the exam gives you these values, and you just need to plug them into the formulas for \(a\) and \(b\)!