Introduction to Bivariate Data
Welcome to the world of Bivariate Data! While "univariate" data looks at one thing at a time (like the heights of students), bivariate data is all about relationships. We look at two different variables for the same individual to see if they are linked. For example, does the amount of time you spend gaming affect your reaction speed?
In this chapter, we’ll learn how to visualize these relationships, measure how strong they are, and even make predictions. Don't worry if the formulas look a bit intimidating at first—most of the heavy lifting is done by your calculator!
1. The Two Types of Bivariate Data
Before we start calculating, we need to understand how the data was collected. The MEI syllabus splits this into two cases:
Case A: Random on Non-Random
This happens when an experimenter controls one variable (the independent variable, \(x\)) and measures the other (the dependent variable, \(y\)).
Example: A scientist decides to test a spring at exactly 10g, 20g, and 30g. The weights are fixed (non-random), but the extension of the spring will vary slightly (random).
Case B: Random on Random
This is when we just observe two things that both happen naturally. We don't control either one.
Example: Measuring the height and weight of 50 random people. Both height and weight are random variables. This usually looks like a "data cloud" on a graph.
Quick Review:
• Case A: One variable is controlled (e.g., "I chose these specific times").
• Case B: Both variables are measured as they are (e.g., "I just recorded what I found").
2. Scatter Diagrams
A scatter diagram is our first port of call. It helps us see the relationship (or correlation) between two variables.
- Independent variable (\(x\)): Usually goes on the horizontal axis. In Case A, this is the variable you controlled.
- Dependent variable (\(y\)): Goes on the vertical axis.
- Outliers: These are data points that don't fit the general pattern. We identify these "by eye" initially.
Did you know? Software-produced scatter diagrams often include a "trendline" and an \(r^2\) value. The closer \(r^2\) is to 1, the better the line fits the data!
3. Pearson’s Product Moment Correlation Coefficient (PMCC)
The PMCC (represented by the letter \(r\)) measures the strength of a linear relationship. Its value is always between -1 and +1.
- \(r = +1\): Perfect positive linear correlation (a perfect straight line going up).
- \(r = 0\): No linear correlation.
- \(r = -1\): Perfect negative linear correlation (a perfect straight line going down).
When is it appropriate to use \(r\)?
For a hypothesis test using PMCC to be valid, the data must follow a Bivariate Normal Distribution. You can't usually prove this, but you can look for an elliptical (football-shaped) cloud of points on your scatter diagram. If the data is skewed, bimodal, or non-linear, PMCC is not the right tool!
Hypothesis Testing for PMCC
We test if there is evidence of correlation in the whole population (represented by the Greek letter \(\rho\), pronounced 'rho').
- Null Hypothesis (\(H_0\)): \(\rho = 0\) (There is no correlation in the population).
- Alternative Hypothesis (\(H_1\)): \(\rho > 0\), \(\rho < 0\) (one-tailed) or \(\rho \neq 0\) (two-tailed).
- Test Statistic: Your calculated \(r\) value.
- Decision: Compare your \(p\)-value to the significance level, or your \(r\) value to a critical value from a table.
Common Mistake: Never say "this proves" there is correlation. Use non-assertive language like: "There is sufficient evidence to suggest that there is positive correlation between..."
4. Spearman’s Rank Correlation Coefficient (\(r_s\))
Sometimes, data isn't linear, or it's just "messy." Spearman’s Rank is used to find association rather than just linear correlation. It measures how monotonic a relationship is (does one variable generally increase as the other increases, even if it's not a straight line?).
Step-by-step process:
1. Rank your \(x\) values (1 for smallest, etc.).
2. Rank your \(y\) values.
3. Use your calculator to find the PMCC of these ranks. This value is your \(r_s\).
Encouraging Tip: Don't worry about "tied ranks" (where two values are the same). The MEI syllabus for the Minor section excludes them from manual calculation!
PMCC vs. Spearman's: Which one to use?
- Use PMCC (\(r\)) if the data is linear and looks like a Bivariate Normal "cloud."
- Use Spearman's (\(r_s\)) if the data is non-linear (but monotonic) or if you have any doubts about the Normal distribution assumption.
5. Linear Regression
Regression is about finding the "Line of Best Fit." We use the Least Squares method, which minimizes the squares of the vertical distances from the points to the line.
The Two Regression Lines
In Case B (Random on Random), there are actually two lines!
- \(y\) on \(x\): Use this to estimate \(y\) when you know \(x\). It minimizes vertical distances.
- \(x\) on \(y\): Use this to estimate \(x\) when you know \(y\). It minimizes horizontal distances.
Key Fact: Both lines always pass through the mean point \((\bar{x}, \bar{y})\).
Residuals
A residual is the difference between the actual observed value and the value predicted by your regression line.
\(Residual = Observed\,y - Predicted\,y\)
If the residuals are small and randomly scattered, your linear model is a good fit!
6. Making Predictions
We use our regression equation \(y = a + bx\) to predict values. However, you must be careful:
- Interpolation: Predicting a value within the range of your data. This is usually reliable.
- Extrapolation: Predicting a value outside the range of your data. This is dangerous because the linear trend might not continue!
Analogy: Interpolation is like guessing the middle of a movie you've seen the start and end of. Extrapolation is like trying to guess what happens in the sequel based only on the first movie—you might be completely wrong!
Key Takeaway Summary:
• PMCC (\(r\)) measures linear strength; needs a "Normal cloud."
• Spearman's (\(r_s\)) measures association using ranks; no Normal assumption needed.
• Hypothesis tests start with \(H_0: No\,correlation\).
• Regression lines are for prediction: always interpolate, rarely extrapolate!