Welcome to Unit S3: Estimation, Confidence Intervals, and Tests!
In your previous Statistics units, you mostly worked with data that was already given to you. In Statistics 3, we step into the shoes of a real-world researcher. How do we know the average height of everyone in a country without measuring every single person? We estimate. In this chapter, you’ll learn how to make "smart guesses" about a whole population using just a small sample, and how to be confident about those guesses. Don't worry if it seems like a lot of formulas at first—we'll break them down step-by-step!
1. Estimators and Bias
When we want to know something about a population (the whole group), we use a statistic calculated from a sample (a small part of the group). This statistic is called an estimator.
Key Terms:
- Estimator: A formula or method used to calculate an estimate (e.g., the sample mean formula).
- Estimate: The actual number you get when you plug your data into the formula.
- Bias: An estimator is unbiased if, on average, it equals the true population value. Imagine throwing many darts at a target; if you aren't biased, the "average" of all your hits would be right in the bullseye!
The Main Unbiased Estimators:
1. Unbiased Estimate of the Population Mean (\(\mu\)): This is just the sample mean, denoted as \(\bar{x}\).
\( \bar{x} = \frac{\sum x}{n} \)
2. Unbiased Estimate of the Population Variance (\(\sigma^2\)): We use the symbol \(s^2\).
Important Note: We divide by \(n-1\) instead of \(n\) to make the estimate unbiased. This is called Bessel's Correction.
\( s^2 = \frac{1}{n-1} \sum (x_i - \bar{x})^2 \) or the more calculator-friendly version:
\( s^2 = \frac{1}{n-1} \left( \sum x^2 - \frac{(\sum x)^2}{n} \right) \)
Quick Review Box:
If a question gives you a sample and asks for an unbiased estimate of the variance, always check if you should divide by \(n-1\). If the data is already summarized as \(S_{xx}\), then \(s^2 = \frac{S_{xx}}{n-1}\).
2. The Distribution of the Sample Mean
If you take many different samples of the same size \(n\) from a population with mean \(\mu\) and variance \(\sigma^2\), the sample means (\(\bar{X}\)) will form their own distribution.
Key Properties:
1. The mean of the sample means is the same as the population mean: \(E(\bar{X}) = \mu\).
2. The variance of the sample means is smaller than the population variance: \(Var(\bar{X}) = \frac{\sigma^2}{n}\).
3. The Standard Error is the standard deviation of this distribution: \(\text{Standard Error} = \frac{\sigma}{\sqrt{n}}\).
Analogy: Individual heights in a city vary a lot (high variance). But if you take groups of 50 people and find the average height of each group, those averages will be very similar to each other (low variance/standard error).
Takeaway: If the original population is Normal, then \(\bar{X} \sim N(\mu, \frac{\sigma^2}{n})\).
3. Confidence Intervals for a Normal Mean
A Confidence Interval (CI) is a range of values that we are "reasonably sure" contains the true population mean \(\mu\).
The Formula:
\( \bar{x} \pm z \left( \frac{\sigma}{\sqrt{n}} \right) \)
Where \(z\) is a value from the Normal tables based on your confidence level:
- For a 90% CI, use \(z = 1.645\)
- For a 95% CI, use \(z = 1.960\)
- For a 99% CI, use \(z = 2.576\)
Step-by-Step Process:
1. Find the sample mean \(\bar{x}\).
2. Identify the population standard deviation \(\sigma\). (If unknown and \(n\) is large, use \(s\)).
3. Choose the correct \(z\)-value for the percentage required.
4. Calculate the "error margin": \(z \times \frac{\sigma}{\sqrt{n}}\).
5. Write the interval as \((\bar{x} - \text{error}, \bar{x} + \text{error})\).
Did you know?
A 95% confidence interval doesn't mean there is a 95% chance the mean is in that specific interval. It means if we repeated the whole experiment 100 times, 95 of the intervals we calculated would contain the true mean.
4. The Central Limit Theorem (CLT)
This is arguably the most powerful tool in Statistics!
The Rule: If your sample size \(n\) is large (usually \(n > 30\)), the distribution of the sample mean \(\bar{X}\) will be approximately Normal, no matter what the original population looks like.
Why is this useful?
If you are testing the mean of a population that is skewed (like house prices or income), you can still use Normal distribution methods as long as your sample is big enough!
Common Mistake: Students often think the CLT says the population becomes Normal. It doesn't! It only says the sample mean distribution becomes Normal.
5. Hypothesis Testing for a Mean
We use these tests to check if a claim about a population mean is likely to be true.
Case A: Single Mean (Variance Known)
We test the null hypothesis \(H_0: \mu = \mu_0\).
The Test Statistic is:
\( z = \frac{\bar{x} - \mu}{\sigma / \sqrt{n}} \)
Case B: Difference Between Two Means
If we want to compare two separate groups (e.g., "Do boys score higher than girls?"), we look at the difference \(\bar{X} - \bar{Y}\).
Assuming the groups are independent, the test statistic is:
\( z = \frac{(\bar{x} - \bar{y}) - (\mu_x - \mu_y)}{\sqrt{\frac{\sigma_x^2}{n_x} + \frac{\sigma_y^2}{n_y}}} \)
Usually, under \(H_0\), we assume the means are equal, so \((\mu_x - \mu_y) = 0\).
Case C: Large Samples (Variance Unknown)
If you don't know the population variance \(\sigma^2\), but your sample is large (\(n > 30\)), you can simply swap \(\sigma^2\) for your unbiased sample estimate \(s^2\). The CLT allows us to still use the \(z\)-test!
Key Takeaway for Tests:
- If \(|z_{\text{calculated}}| > z_{\text{critical}}\), we reject \(H_0\). There is significant evidence that the mean has changed.
- Always conclude with a sentence in context: "There is significant evidence at the 5% level to suggest that the new fertilizer increased the mean height of the plants."
Summary Checklist
- Did I use \(n-1\) for the unbiased variance estimate?
- Is the sample size large enough (\(n > 30\)) to use the Central Limit Theorem?
- Did I divide the standard deviation by \(\sqrt{n}\) to get the Standard Error?
- Is my hypothesis test one-tailed (looking for an increase/decrease) or two-tailed (looking for any change)?
- Have I written my final answer in the context of the question?
Don't worry if this seems tricky at first! Estimation is all about practice. Once you recognize which formula fits the "known" or "unknown" variance scenarios, the rest is just careful calculator work. You've got this!