Welcome to Bivariate Data!

In this chapter, we are going to explore the relationship between two different variables. Think of it like a mathematical matchmaking service—we want to see if changes in one thing (like how much you study) are linked to changes in another (like your exam score). By the end of these notes, you’ll know how to visualize these relationships, measure their strength, and even predict future values. Don't worry if it seems like a lot of jargon at first; we'll take it one step at a time!

1. The Two Types of Bivariate Data

Before we start drawing graphs, we need to understand what we are measuring. In the world of Further Maths, we split bivariate data into two "Cases":

Case A: Random on Non-Random

This happens when an experimenter controls one variable (the independent variable, usually \(x\)) and measures the other (the dependent variable, \(y\)).
Example: You decide to test a spring by hanging specific weights (2kg, 4kg, 6kg) and measuring how much it stretches. You chose the weights, so they aren't "random," but the extension of the spring is.

Case B: Random on Random

This happens when both variables occur naturally and we just observe them. Neither is controlled.
Example: You measure the height and weight of 50 random students. You didn't "choose" a student to be exactly 170cm tall; both height and weight are random variables. This usually looks like a "data cloud" on a graph.

Quick Review:
Case A: One is controlled (like a scientist in a lab).
Case B: Both are random (like observing nature).

Key Takeaway: Identifying the case is important because it changes how we interpret our results later on!

2. Scatter Diagrams

A Scatter Diagram is our first port of call. It’s a visual representation where each pair of data is a point on a grid.

How to set it up:

1. The Independent variable (or the one you control) goes on the horizontal \(x\)-axis.
2. The Dependent variable (the one you measure) goes on the vertical \(y\)-axis.
3. Look for outliers: These are points that "don't fit" the general pattern. They might be errors or very unusual cases.

Did you know? Software often draws a "trendline" for you. It might also give you a value called \(r^2\). This tells you how much of the change in \(y\) is actually explained by the change in \(x\).

Key Takeaway: Always look at the scatter diagram first. If the points look like a random mess, a linear model might not be the right choice!

3. Measuring Correlation: The PMCC (\(r\))

The Pearson’s Product Moment Correlation Coefficient (or PMCC, denoted by \(r\)) is a number that tells us how close the points are to a perfectly straight line.

Important Rules for \(r\):

Range: The value of \(r\) is always between \(-1\) and \(1\).
\(r = 1\): Perfect positive linear correlation (upward slope).
\(r = -1\): Perfect negative linear correlation (downward slope).
\(r = 0\): No linear correlation at all.

When can we use it?

You can only perform a formal hypothesis test with \(r\) if the data is Random on Random (Case B) and follows a Bivariate Normal Distribution. On a scatter diagram, this looks like an elliptical (egg-shaped) cloud of points. If the data looks like a curve, or if it is skewed, the PMCC might be misleading!

Memory Aid: Think of \(r\) as "Reliability of the line." If \(r\) is near 1 or -1, the line is very reliable.

Key Takeaway: PMCC measures linear relationships. It doesn't work well for curves!

4. Spearman’s Rank Correlation Coefficient (\(r_s\))

Sometimes data isn't in a perfect line, or it's hard to measure exactly (like "rankings" in a talent show). This is where Spearman’s Rank comes in.

Why use it?

• It tests for an association (a general trend) rather than just a straight line.
• It works for non-linear data, as long as it is monotonic (always increasing or always decreasing).
• It requires no assumptions about the "Normal" distribution of data. It’s very robust!

The Process:

1. Rank the data for both variables (1st, 2nd, 3rd...).
2. Calculate the PMCC of those ranks (your calculator can do this!).

Common Mistake: Don't forget that when you rank data, you "lose" some information about the actual distances between values. Only use Spearman's if the data isn't suitable for PMCC.

Key Takeaway: Use \(r\) for straight lines and "Normal" data; use \(r_s\) for curves or ranked data.

5. Hypothesis Testing for Correlation

We use hypothesis tests to see if the correlation we found in our sample is likely to exist in the whole population, or if it was just a fluke.

The Setup:

Null Hypothesis (\(H_0\)): There is no correlation in the population (the population correlation \(\rho = 0\)).
Alternative Hypothesis (\(H_1\)): There is a correlation (\(\rho \neq 0\), \(\rho > 0\), or \(\rho < 0\)).

The Decision:

You will compare your calculated \(r\) or \(r_s\) value against a critical value from a table (or use a p-value from software).
• If your value is more extreme than the critical value, reject \(H_0\).
• Always conclude in context: "There is sufficient evidence to suggest a positive correlation between temperature and ice cream sales."

Key Takeaway: A hypothesis test doesn't prove one thing causes another; it just proves they are linked.

6. Regression Lines (The Line of Best Fit)

A regression line is a mathematical equation in the form \(y = a + bx\) that helps us predict values.

Least Squares Regression

This method finds the line that minimizes the sum of the squares of the residuals.
What is a residual? It’s the vertical distance between the actual data point and the line.
Residual = Observed value – Predicted value.

Which line to use?

In Case A: We only have one line (usually \(y\) on \(x\)).
In Case B: We have two possible lines!
1. Use \(y\) on \(x\) to estimate \(y\) for a given \(x\).
2. Use \(x\) on \(y\) to estimate \(x\) for a given \(y\).
Both lines always pass through the "mean point" \((\bar{x}, \bar{y})\).

Interpolation vs. Extrapolation

Interpolation: Predicting a value inside the range of your data. This is usually safe and reliable.
Extrapolation: Predicting a value outside the range of your data. This is dangerous because the trend might not continue!

Quick Review Box:
Residuals: Small residuals = Good fit.
Interpolation: Staying inside the data "box" (Good).
Extrapolation: Going outside the data "box" (Risky).

Key Takeaway: Use the regression line wisely! Don't try to predict the height of a 50-year-old using a model based on toddlers (that's extrapolation!).

7. Summary of Bivariate Data

Visualize: Start with a scatter diagram to check the "Case" and look for outliers.
Measure: Use \(r\) for linear/Normal data and \(r_s\) for non-linear/ranked data.
Test: Use hypothesis tests to see if the relationship is statistically significant.
Predict: Use regression lines for interpolation, but be very careful of extrapolation.
Interpret: Always relate your mathematical findings back to the real-world context provided in the question.