Introduction to Linear Regression

Welcome to the first chapter of Further Statistics 2! Don't worry if you found statistics a bit dry in the past; Linear Regression is where we actually start using data to make predictions about the real world.

In simple terms, linear regression is the art of finding the "best-fitting" straight line through a set of data points. Imagine you are trying to predict how much a house will cost based on its size. You have a bunch of data points, and you want to draw a line that gets as close to all of them as possible. That’s linear regression! In this chapter, we will learn exactly how to calculate that line and, more importantly, how to check if that line is actually any good.

Did you know? The term "regression" was coined by Francis Galton in the 19th century when he noticed that children of very tall parents tended to be shorter than their parents—"regressing" toward the average height!


1. The Least Squares Regression Line

In your standard A Level Maths, you learned about the line of best fit. In Further Mathematics, we use a specific method called Least Squares to find the most accurate line possible. This is for the regression of y on x, which we use when we want to predict a dependent variable (\(y\)) from an independent variable (\(x\)).

What is a "Least Squares" Line?

Imagine your data points are like fireflies hovering in a jar. You want to slide a thin glass plate (the regression line) into the jar so that it sits right in the middle of them.

For any point, there is a vertical distance between the actual data point and the line. This distance is called a residual. Some points are above the line (positive residual), and some are below (negative residual). To find the "best" line, we square all these distances (to make them all positive) and then add them up. The "Least Squares" line is the one that makes this sum of squares as small as possible.

The Regression Equation

The equation of the regression line of \(y\) on \(x\) is written as:

\(y = a + bx\)

To find the coefficients \(a\) and \(b\), we use these standard formulae:

  1. Calculate the gradient (\(b\)): \(b = \frac{S_{xy}}{S_{xx}}\)
  2. Calculate the intercept (\(a\)): \(a = \bar{y} - b\bar{x}\)

Quick Review: Remember from earlier statistics that:
\(S_{xx} = \sum x^2 - \frac{(\sum x)^2}{n}\)
\(S_{xy} = \sum xy - \frac{(\sum x)(\sum y)}{n}\)
\(\bar{x}\) and \(\bar{y}\) are the means of the \(x\) and \(y\) values.

Step-by-Step Guide:
1. Calculate the summary statistics (\(\sum x, \sum y, \sum x^2, \sum xy, n\)).
2. Find \(S_{xx}\) and \(S_{xy}\).
3. Find \(b\) first (you need it for the next step!).
4. Find the means \(\bar{x}\) and \(\bar{y}\).
5. Plug them into the formula for \(a\).
6. Write out your final equation: \(y = a + bx\).

Common Mistake to Avoid: Always calculate \(b\) before \(a\). If you try to do it the other way around, you'll get stuck because the formula for \(a\) requires the value of \(b\)!

Key Takeaway: The least squares regression line minimizes the sum of the squares of the vertical distances (residuals) from the points to the line.


2. Understanding Residuals

As we mentioned, a residual is simply the "leftover" or the error for a specific data point. It’s the difference between what actually happened and what our line predicted would happen.

The Residual Formula

For any specific point \((x_i, y_i)\), the residual \(e_i\) is calculated as:

\(e_i = y_i - (a + bx_i)\)

In plain English: Residual = Observed Value - Predicted Value.

  • If the residual is positive, the actual data point is above the line (the model underestimated).
  • If the residual is negative, the actual data point is below the line (the model overestimated).
  • If the residual is zero, the point lies exactly on the line.

Why do we care about residuals?

Residuals are like a "health check" for your mathematical model. By looking at them, we can see if our straight line is actually a good fit for the data.

Memory Aid: Think of the regression line as a diet plan and the data points as your actual weight. The residuals are the difference between what the diet plan said you'd weigh and what the scale actually shows. If the residuals are huge, you might need a better plan!

Key Takeaway: Residuals tell us how far off our predictions are for each individual data point.


3. Residual Sum of Squares (RSS)

While an individual residual tells us about one point, the Residual Sum of Squares (RSS) tells us how well the line fits the entire set of data.

The RSS is the value that the "Least Squares" method works so hard to minimize. A smaller RSS means the line is a better fit for the data.

The RSS Formula

In your exam, you don't need to derive this, but you must be able to calculate it using this standard formula:

\(RSS = S_{yy} - \frac{(S_{xy})^2}{S_{xx}}\)

Helpful Tip: You might notice that \(\frac{(S_{xy})^2}{S_{xx}}\) is actually the same as \(b \times S_{xy}\). So, you can also think of it as:
\(RSS = S_{yy} - bS_{xy}\)

Quick Review Box:
Low RSS: Points are very close to the line. Predictions are likely accurate.
High RSS: Points are scattered far from the line. Predictions might be unreliable.

Key Takeaway: RSS is a single number that summarizes the total "error" of your regression line. The smaller, the better!


4. Evaluating the Model and Refinement

Just because we can calculate a regression line doesn't mean we should. We use residuals to check if our linear model is reasonable.

Checking for "Reasonableness"

If you plot your residuals on a graph (a residual plot), you are looking for random scatter.

  1. Good Fit: The residuals are randomly scattered above and below the zero line with no obvious shape. This means a linear model is appropriate.
  2. Bad Fit (Non-linear): If the residuals form a "U" shape or a curve, it suggests that the real relationship isn't a straight line—it might be a quadratic or exponential curve instead.
  3. Outliers: A point with a very large residual compared to the others is a potential outlier. It's a point where the model failed significantly.

Refining the Model

If we find that our model isn't great, we can refine it by:

  • Removing Outliers: If a specific data point was a mistake (e.g., a typing error in a lab report), removing it will change \(a\) and \(b\) and likely decrease the RSS.
  • Changing the Model: If the residuals show a pattern, we might need to transform the data (which you'll see in other chapters) rather than using a simple linear regression.

Summary of Model Evaluation:
- Use residuals to find outliers.
- Use residual plots to check if a straight line is the right choice.
- Use RSS to compare different models (the one with the lower RSS is generally better).

Key Takeaway: Always look at your residuals! They tell the story that the regression equation alone cannot tell.


Congratulations! You've covered the core concepts of Linear Regression for Further Statistics 2. Keep practicing those \(S_{xx}\) and \(S_{xy}\) calculations, and you'll be a pro in no time!