Welcome to Linear Regression!

In your A Level Maths, you’ve likely seen "lines of best fit" drawn by eye. In Further Mathematics, we take this a step further. We use Linear Regression to calculate the mathematically perfect line of best fit. This allows us to predict values and understand the relationship between two variables with much more precision. Whether you're predicting future profits or the results of a scientific experiment, regression is the tool you need.

Don't worry if this seems tricky at first! We will break it down step-by-step, from identifying your variables to using your calculator effectively.


1. Independent vs. Dependent Variables

Before we can calculate anything, we need to know which variable is which. In Statistics, we usually look at how one thing affects another.

  • Independent Variable (x): Also called the explanatory or controlled variable. This is the one that is set or measured first. For example, the time spent revising.
  • Dependent Variable (y): Also called the response variable. This is what we measure to see how it changed. For example, the test score.

Real-World Example: If you are investigating how the amount of fertilizer affects plant growth, the amount of fertilizer is the independent variable (\(x\)) because you decide how much to give. The plant height is the dependent variable (\(y\)) because it "depends" on the fertilizer.

Did you know? Sometimes, neither variable is strictly "controlled." For example, if you measure the arm length and leg length of athletes, neither one "causes" the other, but we still assign one to \(x\) and one to \(y\) to find a relationship.

Quick Review: Always plot the independent variable on the horizontal axis (x) and the dependent variable on the vertical axis (y).


2. The "Least Squares" Concept

How do we decide which line is truly the "best"? We use the Least Squares method.

Imagine a scatter diagram. For any line we draw, there will be a vertical distance between each data point and the line. This distance is called a residual. Some points are above the line (positive residual), and some are below (negative residual).

To find the best line, we:

  1. Square all these residuals (so the negative numbers become positive).
  2. Add them all up.
  3. Find the line that makes this sum of squares as small as possible.

Analogy: Imagine each data point is connected to a metal rod (the line) by a spring. The rod will naturally settle in the position where the total tension in all the springs is at its lowest. That's your Least Squares Regression Line!


3. The Equation of the Regression Line

For the OCR syllabus, the equation of the regression line of y on x is written as:

\(y = a + bx\)

Where:

  • b is the gradient (how much \(y\) increases for every 1 unit increase in \(x\)).
  • a is the y-intercept (the value of \(y\) when \(x = 0\)).

How to calculate b and a:

You will often be given summary statistics like \(\sum x\), \(\sum y\), \(\sum x^2\), and \(\sum xy\). Use these formulas:

1. Calculate \(S_{xx} = \sum x^2 - \frac{(\sum x)^2}{n}\)

2. Calculate \(S_{xy} = \sum xy - \frac{(\sum x)(\sum y)}{n}\)

3. Find the gradient: \(b = \frac{S_{xy}}{S_{xx}}\)

4. Find the intercept: \(a = \bar{y} - b\bar{x}\)

(Note: \(\bar{x}\) and \(\bar{y}\) are the means of \(x\) and \(y\))

Common Mistake to Avoid: Don't mix up \(a\) and \(b\)! In pure maths, you usually use \(y = mx + c\). In Statistics, we often use \(y = a + bx\). Make sure you read the calculator output carefully!

Key Takeaway: The regression line always passes through the "mean point" \((\bar{x}, \bar{y})\).


4. Linear Coding

Sometimes the data values are huge (like 1,000,005) or tiny (0.00002). To make it easier, we "code" the data using a linear transformation like \(u = \frac{x - c}{d}\).

If you calculate a regression line for the coded data (e.g., \(v = a' + b'u\)), you can find the original regression line by substituting the coding formulas back into the equation.

Memory Aid: Coding is just like changing the "scale" or "units" of your graph. The relationship between the variables stays the same, just the numbers look different!


5. Using the Line for Estimation

The main reason we find this equation is to make predictions. If we have a value for \(x\), we can plug it into \(y = a + bx\) to estimate \(y\).

Interpolation vs. Extrapolation

  • Interpolation: Making a prediction within the range of your original data. This is usually very reliable.
  • Extrapolation: Making a prediction outside the range of your data. This is dangerous and unreliable because we don't know if the linear trend continues forever.

Example: If you measure a child's height from age 2 to 10, you can safely predict their height at age 5 (interpolation). However, using that same line to predict their height at age 40 (extrapolation) would suggest they will be 10 feet tall!

Key Point: When asked to comment on the reliability of an estimate, check if it's interpolation or extrapolation, and check how strong the correlation is.


Summary Checklist

1. Identify your independent (\(x\)) and dependent (\(y\)) variables.
2. Calculate summary statistics (\(S_{xx}\) and \(S_{xy}\)) or use your calculator's 1-Var/2-Var mode.
3. Form the equation \(y = a + bx\).
4. Interpret \(a\) and \(b\) in the context of the question (e.g., "The starting temperature was \(a\), and it increased by \(b\) degrees per minute").
5. Estimate values, but be wary of extrapolation!

Keep practicing! Regression is one of the most useful parts of Statistics because it's used in almost every industry to plan for the future. You've got this!