Introduction to Linear Regression

Welcome to the chapter on Linear Regression! In your previous studies, you’ve likely drawn "lines of best fit" by eye on scatter diagrams. In Further Mathematics, we take this a step further. We will learn how to mathematically calculate the single most accurate line that represents the relationship between two variables. This allows us to make smart predictions and understand how one thing affects another.

Linear regression is vital in the real world—from doctors predicting a patient's health outcomes to businesses forecasting sales based on advertising spend. Don't worry if this seems a bit technical at first; we will break it down into simple, manageable steps!

1. Who's in Charge? Independent vs. Dependent Variables

Before we can calculate a line, we need to know which variable is which. In any experiment or observation involving two variables (\(x\) and \(y\)), we usually have:

  • Independent (or Controlled) Variable (\(x\)): This is the variable that we either control or is the "input." For example, the amount of time you spend studying.
  • Dependent (or Response) Variable (\(y\)): This is the variable we measure to see how it reacts. For example, your test score depends on how much you studied.

Important Note: Sometimes, neither variable is strictly "controlled." If a scientist measures the height and arm span of various people, neither variable truly "controls" the other. However, for the purpose of regression, we usually choose one to be the predictor (\(x\)) and the other to be the result (\(y\)).

Quick Review: The Axis Rule

In a scatter diagram, always put the Independent variable on the horizontal x-axis and the Dependent variable on the vertical y-axis.

Key Takeaway: Identifying which variable is independent (\(x\)) and which is dependent (\(y\)) is the first step in any regression problem.

2. The "Least Squares" Concept

Why do we call the line we calculate the Least Squares Regression Line? Imagine you draw a line through a cluster of points. Some points will be above the line, and some will be below. The vertical distance between a point and the line is called a residual.

To find the absolute "best" line, we want these residuals to be as small as possible. However, some distances are positive and some are negative, so if we just added them up, they would cancel each other out. To fix this, we square the distances (making them all positive) and then find the line that makes the sum of these squares as small as possible. This is why it’s called "Least Squares."

Analogy: Imagine the data points are like magnets pulling on a metal rod (the line). The "Least Squares" line is the position where the rod finally settles because the "pull" from all the points is perfectly balanced.

3. Calculating the Regression Line of \(y\) on \(x\)

The equation of the regression line looks just like the equation of a straight line you learned in GCSE: \( y = a + bx \)

  • \(b\) is the gradient (how steep the line is).
  • \(a\) is the y-intercept (where the line crosses the y-axis).

Step-by-Step Calculation

You can calculate these values using summarised data (statistics like the mean and sums of squares) or raw data using your calculator's statistics mode.

Step 1: Calculate the Gradient (\(b\))

First, find \(b\) using the formula: \( b = \frac{S_{xy}}{S_{xx}} \)

Where \(S_{xy}\) and \(S_{xx}\) are the sums of squares you practiced in the Correlation chapter. Note: A positive \(b\) means a positive correlation; a negative \(b\) means a negative correlation.

Step 2: Calculate the Intercept (\(a\))

Once you have \(b\), use the means of \(x\) and \(y\) (\(\bar{x}\) and \(\bar{y}\)) to find \(a\): \( a = \bar{y} - b\bar{x} \)

Did you know? The regression line always passes through the point of the means \((\bar{x}, \bar{y})\). This is a great way to check if your line is positioned correctly on a graph!

Common Mistake to Avoid:

Students often try to calculate the "regression line of \(x\) on \(y\)." In this syllabus, if \(x\) is the independent variable, you only calculate the line of \(y\) on \(x\). Don't swap them!

Key Takeaway: Find \(b\) first, then use it to find \(a\). The final equation should always be written in the form \(y = a + bx\).

4. The Effect of Linear Coding

Sometimes, data is "coded" to make it easier to handle (e.g., subtracting 1000 from every value or dividing by 10). This is called Linear Coding.

If you change the units of your data (like changing meters to centimeters), the regression line will change too. If you apply a code like \(x_{new} = \frac{x - 10}{2}\), the gradient and intercept of your regression line will shift accordingly.

Simple Trick: If you are given a regression line for coded data and need to find the original line, simply substitute the coding formulas back into the \(y = a + bx\) equation and rearrange it!

5. Using the Line for Predictions

The whole point of finding the equation \(y = a + bx\) is to estimate values of \(y\) for a given \(x\). This is like having a mathematical crystal ball!

Interpolation vs. Extrapolation

  • Interpolation: Making a prediction within the range of the original data. Example: If your data covers ages 5 to 15, predicting for age 10 is interpolation. This is generally reliable.
  • Extrapolation: Making a prediction outside the range of the original data. Example: Predicting for age 50 using that same data. This is unreliable and risky because the relationship might change outside the range we observed.

Understanding Uncertainty

Even a "best-fit" line isn't perfect. When you make an estimate, you should always interpret it in context:

  • If the correlation is very strong (points are close to the line), your estimate is more likely to be accurate.
  • If the correlation is weak, or if you are extrapolating, your estimate has a high level of uncertainty.
Quick Review: Prediction Reliability

Reliable = Strong correlation + Interpolation
Unreliable = Weak correlation OR Extrapolation

Key Takeaway: Use your line to estimate \(y\), but always check if your \(x\) value is within the original data range before trusting the result!

Final Summary Checklist

  • Have I correctly identified the independent (\(x\)) and dependent (\(y\)) variables?
  • Did I calculate \(b\) (the gradient) before \(a\) (the intercept)?
  • Does my line pass through the mean point \((\bar{x}, \bar{y})\)?
  • Is my prediction interpolation (inside the range) or extrapolation (outside)?
  • If the data is coded, have I converted my final answer back to the original units?