Welcome to Unit 2: Exploring Two-Variable Data!

In Unit 1, we looked at one variable at a time (like just looking at the heights of students). In Unit 2, we are leveling up! We’re going to look at how two different variables might relate to each other. For example: Does the amount of time you spend studying actually predict your exam score? Does the weight of a car affect its gas mileage?

Don't worry if this seems a little intimidating at first. Statistics is really just a way of telling stories with numbers, and we're about to learn how to tell the story of a relationship between two things.

Section 1: Relationship Basics (Explanatory vs. Response)

Before we graph anything, we need to know which variable is which.

1. Explanatory Variable (x): This is the variable we think helps explain or predict the change in the other variable. Think of it as the "independent" variable. It always goes on the horizontal x-axis.
2. Response Variable (y): This is the outcome we are measuring. It "responds" to the explanatory variable. It always goes on the vertical y-axis.

Example: If you are studying how "Hours of Sleep" affects "Test Scores," the sleep is the explanatory (x) and the test score is the response (y).

How to Describe a Scatterplot

When you look at a graph of dots (a scatterplot), you need to describe it using four specific characteristics. A great way to remember this is the acronym DUFS:

D - Direction: Is it positive (going up from left to right) or negative (going down from left to right)?
U - Unusual Features: Are there any outliers (dots far away from the pattern) or distinct clusters?
F - Form: Is it linear (looks like a straight line) or non-linear (curved)?
S - Strength: How closely do the points follow the form? Use words like weak, moderate, or strong.

Key Takeaway: Always describe a scatterplot using Direction, Unusual features, Form, and Strength in the context of the variables.

Section 2: Correlation (\(r\))

The correlation coefficient, denoted as \(r\), is a number that tells us exactly how strong and in what direction a linear relationship is.

Important Properties of \(r\):
• \(r\) is always between -1 and 1.
• If \(r\) is close to 1, it’s a strong positive linear relationship.
• If \(r\) is close to -1, it’s a strong negative linear relationship.
• If \(r\) is close to 0, there is a very weak linear relationship (it looks like a cloud of random dots).
• \(r\) has no units. If you change inches to centimeters, \(r\) stays the exact same!

Common Mistake to Avoid: Correlation does NOT mean causation! Just because two things have a high \(r\) value doesn't mean one causes the other. For example, ice cream sales and shark attacks are highly correlated, but that's just because they both happen more in the summer when it's hot!

Section 3: The Least-Squares Regression Line (LSRL)

The LSRL is just a fancy name for the "line of best fit." It’s the straight line that gets as close as possible to all the data points at once. In AP Stats, we write the equation like this:

\(\hat{y} = a + bx\)

What do these symbols mean?
• \(\hat{y}\) (pronounced "y-hat"): This is the predicted y-value for a given x.
• \(a\): The y-intercept. This is the predicted value of y when x is 0.
• \(b\): The slope. This is how much we predict y will change for every 1-unit increase in x.

Interpreting the Slope (The Template)

When the AP exam asks you to interpret the slope, use this sentence: "For every 1 unit increase in [explanatory variable x], there is a predicted increase/decrease of [slope value] in [response variable y]."

Pro-Tip: Never forget the word "predicted" or "on average." Statistics isn't about certainties; it's about patterns!

Section 4: Residuals

A residual is the difference between the actual data point and what the line predicted.

Formula: \(Residual = y - \hat{y}\)

Memory Aid: Just remember RAP (\(Residual = Actual - Predicted\)).

• A positive residual means the actual point is above the line (the model underestimated).
• A negative residual means the actual point is below the line (the model overestimated).
• If a residual is 0, the point is exactly on the line.

Residual Plots

A residual plot is a separate graph that shows the residuals on the y-axis. It helps us decide if a linear model is appropriate.
• If the residual plot looks like a random scatter (no pattern), then a linear model is appropriate.
• If the residual plot has a clear curve or U-shape, then a linear model is NOT appropriate; you probably need a curve!

Key Takeaway: We want our residual plots to look like boring, random clouds of points.

Section 5: Assessing the Fit (\(s\) and \(r^2\))

How do we know if our line is actually doing a good job? We use two main numbers:

1. Standard Deviation of the Residuals (\(s\)):
This number tells us the typical distance that the actual y-values are from the predicted y-values. It’s measured in the same units as the y-variable.

2. Coefficient of Determination (\(r^2\)):
This is literally the correlation coefficient \(r\) squared. It’s usually written as a percentage.
Interpretation Template: "About [\(r^2\) %] of the variation in [response variable y] is accounted for by the linear model relating it to [explanatory variable x]."

Quick Review:
• \(r\): Strength and Direction.
• \(r^2\): Percent of variation explained.
• \(s\): Typical prediction error.

Section 6: Unusual Points

Not all points are created equal! There are three types of "special" points:

Outliers: Points that have a large residual (they are far away from the line vertically).
High Leverage Points: Points that have an x-value far away from the other x-values (far to the left or right). They have the potential to "pull" on the line like a lever.
Influential Points: A point is influential if removing it significantly changes the slope, y-intercept, or \(r\) value.

Did you know? A point can have high leverage but not be influential if it falls perfectly in line with the rest of the data. It only becomes influential if it "tilts" the line!

Wrapping Up Unit 2

Summary Checklist:
1. Can you identify \(x\) and \(y\)?
2. Can you describe a scatterplot using DUFS?
3. Do you know that \(r\) only measures linear relationships?
4. Can you calculate a residual using Actual - Predicted?
5. Can you interpret the slope and \(r^2\) in context?
6. Does the residual plot show a random scatter?

Great job! You've just covered the core of Two-Variable Data. Keep practicing those interpretations—they are the key to scoring high on the AP exam!