Welcome to Estimation and Confidence Intervals!

In this chapter, we are going to learn how to be "mathematical detectives." In the real world, we almost never know everything about a population (like the exact average height of every person on Earth). Instead, we take a sample and use it to make a very good guess about the population. This is called estimation.

Don’t worry if some of the formulas look scary at first—we’ll break them down step-by-step. By the end of this, you’ll be able to calculate exactly how "confident" you are in your statistical guesses!

1. Estimators: Making the Best Guess

An estimator is like a rule or a formula that we use to guess a population parameter. An estimate is the actual number you get when you plug in your data.

Bias and Unbiased Estimators

Imagine you are practicing archery. If your arrows are always hitting a bit to the left of the bullseye, your aim is biased. In statistics, we want unbiased estimators. This means that if we took many, many samples, the average of all our estimates would equal the true population value.

There are two key unbiased estimators you need to know:

  1. The Sample Mean (\(\bar{x}\)): This is an unbiased estimator of the population mean (\(\mu\)). It’s just the average of your data!
  2. The Sample Variance (\(s^2\)): To make the sample variance unbiased, we divide by \(n - 1\) instead of just \(n\).

The Formula for Unbiased Sample Variance:
\(s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2\)

Quick Review: Why \(n - 1\)? Dividing by \(n - 1\) (known as Bessel's correction) slightly increases the result. This compensates for the fact that a small sample is likely to be less spread out than the whole population.

Quality of Estimators

How do we choose between two different unbiased estimators? We look at their variance. A "good" estimator is not only unbiased but also has a small variance.
Analogy: Think of two archers. Both hit the bullseye on average, but Archer A's arrows are all tightly packed, while Archer B's arrows are scattered everywhere. Archer A is the better "estimator" because they are more consistent!

Key Takeaway: We want estimators that are unbiased and have minimum variance.

2. Confidence Intervals: The "Safety Net"

A confidence interval is a range of values that we are fairly sure contains the true population mean. Instead of saying "the average is exactly 50," we say "we are 95% sure the average is between 48 and 52."

How to interpret it (The tricky part!)

Students often make a mistake here. A "95% Confidence Interval" does not mean there is a 95% probability that the population mean is in that specific interval.
It means: "If we took many samples and built a confidence interval for each one, 95% of those intervals would contain the true population mean."

Calculating a Confidence Interval for a Normal Mean

When the population variance (\(\sigma^2\)) is known, we use the Normal Distribution to find the limits:

\(\bar{x} \pm z \frac{\sigma}{\sqrt{n}}\)

  • \(\bar{x}\): Your sample mean.
  • \(z\): The critical value (depends on your confidence level, e.g., 1.96 for 95%).
  • \(\frac{\sigma}{\sqrt{n}}\): This is called the Standard Error. It tells us how much the sample mean varies.

Step-by-Step Process:
1. Find the sample mean (\(\bar{x}\)).
2. Identify the population standard deviation (\(\sigma\)) and sample size (\(n\)).
3. Choose your confidence level (e.g., 95%) and find the corresponding \(z\)-value from your tables.
4. Calculate the error margin: \(z \times \frac{\sigma}{\sqrt{n}}\).
5. Add and subtract this from \(\bar{x}\) to get your interval boundaries.

Did you know? As your sample size (\(n\)) gets larger, your confidence interval gets narrower. This makes sense: more data means you are more certain about your answer!

Key Takeaway: A confidence interval provides a range of plausible values for a parameter, with the width determined by the standard error and the confidence level.

3. Hypothesis Tests: Difference Between Two Means

Sometimes we want to compare two different groups—for example, "Do students in Class A score higher than Class B?"

The Scenario

If we have two independent Normal populations with known variances (\(\sigma_x^2\) and \(\sigma_y^2\)), we can test if the difference between their means is a specific value (usually zero).

The Null Hypothesis (\(H_0\)): \(\mu_x - \mu_y = D\) (usually \(D=0\)).
The Test Statistic:
\(Z = \frac{(\bar{X} - \bar{Y}) - (\mu_x - \mu_y)}{\sqrt{\frac{\sigma_x^2}{n_x} + \frac{\sigma_y^2}{n_y}}}\)

This follows the Standard Normal Distribution \(Z \sim N(0, 1)\).

Common Mistake to Avoid: Don't forget to square the standard deviations if the question gives you \(\sigma\) instead of \(\sigma^2\)! Also, ensure you add the variances in the denominator, even if you are testing for a difference.

4. Large Samples and the Central Limit Theorem (CLT)

What if the population isn't Normal, or we don't know the population variance (\(\sigma^2\))?

If your sample size (\(n\)) is large (usually \(n > 30\)), the Central Limit Theorem saves the day! It tells us that the distribution of the sample mean will be approximately Normal regardless of the original population's shape.

Unknown Variance in Large Samples

If we don't know \(\sigma^2\), we can use our unbiased sample variance (\(s^2\)) as a replacement. The formula for the test statistic or confidence interval stays basically the same, just swap \(\sigma^2\) for \(s^2\):

\(Z \approx \frac{(\bar{X} - \bar{Y}) - (\mu_x - \mu_y)}{\sqrt{\frac{s_x^2}{n_x} + \frac{s_y^2}{n_y}}}\)

Quick Review Box:
- Small sample + Variance Known: Use \(Z\) (Normal).
- Large sample + Variance Unknown: Use \(Z\) (Normal) by replacing \(\sigma^2\) with \(s^2\) (thanks to CLT).
- Standard Error: Always involves dividing by \(\sqrt{n}\).

Key Takeaway: For large samples, we can treat the sample variance as the population variance and use Normal distribution methods to perform tests and find intervals.

Summary of Key Terms

Parameter: A value describing the whole population (e.g., \(\mu\)).
Statistic: A value calculated from a sample (e.g., \(\bar{x}\)).
Unbiased: On average, the estimator hits the true value.
Standard Error: The standard deviation of the estimator (usually \(\frac{\sigma}{\sqrt{n}}\)).
Central Limit Theorem: The "magic" rule that allows us to use Normal distributions for large samples.