Welcome to Correlation and Linear Regression!
Ever wondered if there is a real connection between the number of hours you spend on social media and your test scores? Or if taller people really do have bigger feet? In this chapter, we explore bivariate data—which is just a fancy way of saying we are looking at the relationship between two different variables. By the end of these notes, you’ll know how to spot patterns, measure how strong they are, and even predict the future (mathematically, of course)!
1. Scatter Diagrams: Seeing the Pattern
Before we touch any formulas, we always start by looking at the data. A scatter diagram is a graph where we plot pairs of data points \((x, y)\) on a coordinate plane.
Why do we use them?
We use scatter diagrams to judge if there is a plausible linear relationship. In simple terms: do the dots look like they are trying to form a straight line?
- Positive Linear Correlation: As \(x\) increases, \(y\) also tends to increase (the dots go "up-hill").
- Negative Linear Correlation: As \(x\) increases, \(y\) tends to decrease (the dots go "down-hill").
- Non-linear Relationship: The dots form a curve (like a "U" shape).
- No Correlation: The dots are scattered everywhere like spilled pepper; there’s no clear direction.
Analogy: Think of a scatter diagram like a "connect-the-dots" puzzle where the dots don't perfectly line up. Your job is to see if a straight ruler could roughly cover most of them.
Key Takeaway:
Always plot your scatter diagram first! It prevents you from trying to fit a straight line to data that is actually curved.
2. The Product Moment Correlation Coefficient (\(r\))
Now that we’ve seen the pattern, we need a way to measure it. This is where \(r\) comes in. It tells us two things: the strength and the direction of the linear relationship.
The "r" Scale
The value of \(r\) is always between \(-1\) and \(1\).
- \(r = 1\): Perfect positive linear correlation (a perfect straight line up).
- \(r = -1\): Perfect negative linear correlation (a perfect straight line down).
- \(r = 0\): Absolutely no linear relationship.
- \(r \approx 0.9\): Very strong positive correlation.
- \(r \approx -0.3\): Weak negative correlation.
Common Mistake to Avoid!
Correlation does NOT mean Causation. Just because two things are correlated (like ice cream sales and shark attacks), it doesn't mean one causes the other! (In that case, the "cause" is actually hot summer weather).
Quick Review:
Close to 1 or -1: Strong linear relationship.
Close to 0: Weak or no linear relationship.
3. Linear Regression: Finding the Best-Fit Line
If the scatter diagram looks linear, we use the Method of Least Squares to find the equation of the "best" straight line. This line is called the Regression Line.
Which line should you use?
In H2 Math, we usually work with two variables: the Independent Variable (\(x\)) and the Dependent Variable (\(y\)).
- The \(y\) on \(x\) line: Used when you want to predict \(y\) given a value of \(x\). This is the most common line you will use. Its form is \(y = a + bx\).
- The \(x\) on \(y\) line: Used when you want to predict \(x\) given a value of \(y\). Its form is \(x = c + dy\).
Don't worry if this seems tricky: Your Graphing Calculator (GC) does the heavy lifting of calculating the values of \(a\) and \(b\) for you! Just make sure you input your data correctly into the lists.
4. Making Predictions: Interpolation vs. Extrapolation
Once you have your regression line equation, you can plug in numbers to make estimates. But be careful!
Interpolation (The Safe Zone)
This is predicting a value within the range of your original data.
Example: If your data is for students aged 13 to 18, predicting a result for a 15-year-old is interpolation. It is generally reliable.
Extrapolation (The Danger Zone)
This is predicting a value outside the range of your data.
Example: Predicting the height of a 40-year-old based on data from toddlers. It is unreliable because the linear trend might not continue forever!
Key Takeaway:
Predictions are most reliable when the correlation is strong ( \(r\) is close to 1 or -1) and when you are interpolating.
5. Data Transformation: Dealing with Curves
What if the scatter diagram shows a curve? We can "straighten" it using transformations. The syllabus requires you to know how to use square, reciprocal, or logarithmic transformations.
How it works:
Instead of plotting \(y\) against \(x\), we might plot:
- \(y\) against \(x^2\)
- \(y\) against \(\frac{1}{x}\)
- \(y\) against \(\ln x\)
- \(\ln y\) against \(x\)
How to choose the best model?
When you try different transformations on your GC, the best model is the one where the absolute value of \(r\) is closest to 1. This means that specific transformation made the data look the most like a straight line.
Step-by-Step Hint:
1. Look at the scatter plot shape.
2. Apply the transformation suggested in the question.
3. Check the new \(r\) value.
4. Use the new equation (e.g., \(y = a + b(\ln x)\)) to make predictions.
Summary Checklist for Exams
1. Scatter Diagram: Did I label my axes? Did I describe the relationship (linear/non-linear, positive/negative)?
2. Correlation Coefficient (\(r\)): Is it strong or weak? Does it support using a linear model?
3. Regression Line: Am I using the right line (\(y\) on \(x\) to predict \(y\))?
4. Reliability: Is the prediction an interpolation? Is \(r\) high enough? (Always mention these two points!)
5. Transformations: Did I remember to plug my value into the transformed variable (like \(\ln x\) instead of just \(x\))?
You've got this! Correlation and Regression is one of the more "visual" chapters in H2 Math. Master your Graphing Calculator skills, and you'll find these questions very manageable.