Welcome to Inference!
In Statistics, we often want to know something about a huge group of people or things (the population), but we can't possibly check every single one. Imagine trying to find the average height of every person in the UK—it’s impossible! Instead, we take a smaller sample and use it to make an educated guess about the whole population. This "educated guessing" is what we call Statistical Inference.
In this chapter, you’ll learn how to estimate population values, how to build "safety nets" called confidence intervals, and how to test if an average value is actually what people claim it is. Don't worry if it sounds a bit technical at first; we will break it down step-by-step!
1. Estimating Parameters: Point Estimates
A point estimate is just a single number from our sample that we use as our best guess for the population.
- Estimating the Mean: We use the sample mean, denoted as \(\bar{x}\), to estimate the population mean \(\mu\). So, \(\hat{\mu} = \bar{x}\).
- Estimating the Variance: To get an unbiased estimate of the population variance (\(\sigma^2\)), we use the sample variance \(s^2\). Important Tip: We divide by \(n - 1\) instead of \(n\). This "corrects" the estimate because small samples tend to slightly underestimate how spread out a population is.
The formula for the unbiased estimate of variance is:
\(s^2 = \frac{1}{n-1} \sum (x_i - \bar{x})^2\)
What is the Standard Error?
The Standard Error of the Mean (SE) tells us how much we expect our sample mean to vary from the true population mean. Think of it as the "standard deviation of our guesses."
Formula: \(SE = \frac{\sigma}{\sqrt{n}}\) (If we don't know \(\sigma\), we use \(s\) from our sample instead).
Quick Review: Larger samples (\(n\)) make the Standard Error smaller. This makes sense: a bigger sample gives you a more reliable guess!
2. The Sampling Distribution and the CLT
If you take many samples and plot their means, those means form their own distribution. This is the sampling distribution of the mean.
- If the original population is Normal, the sample means will also be Normal.
- The Central Limit Theorem (CLT): This is like a magic trick in stats. It says that if your sample size is "large enough" (usually \(n > 30\)), the distribution of the sample mean will look like a Normal Distribution, even if the original population was weirdly shaped!
Analogy: Imagine a soup. Even if the ingredients (the population) are chunky and uneven, if you take a large enough ladle (the sample), the average taste of that ladle will be very consistent with the next ladle.
3. Confidence Intervals (CIs)
Instead of just giving one number (a point estimate), a confidence interval gives a range of values where we are fairly sure the true population mean lives.
The "Z" vs "T" Choice
This is where many students get stuck, but the rule is simple:
1. Use the Normal Distribution (z) if:
- The sample size is large (use \(s^2\) as an estimate for \(\sigma^2\)).
- The population variance \(\sigma^2\) is known.
2. Use the t-distribution if:
- The sample size is small AND the population variance \(\sigma^2\) is unknown (but the population must be Normal).
Factors affecting the width of your CI:
- Sample Size (\(n\)): Bigger samples = narrower (more precise) intervals.
- Confidence Level: Higher confidence (e.g., 99% vs 95%) = wider intervals. (If you want to be more certain you caught the "fish," you need a bigger net!)
- Population Variability (\(\sigma\)): More spread out data = wider intervals.
Key Takeaway: A 95% Confidence Interval means that if we took 100 different samples and built 100 intervals, we expect about 95 of them to actually contain the true population mean.
4. Paired Samples
Sometimes data comes in pairs. For example, testing a person's reaction time before and after drinking coffee. These aren't two independent groups; they are the same people measured twice.
To handle this, we calculate the difference for each pair. We then treat these differences as a single sample and use the same mean and CI methods we used before.
Step-by-step:
1. Subtract 'Before' from 'After' for each person.
2. Find the mean of those differences (\(\bar{d}\)).
3. Find the standard deviation of those differences (\(s_d\)).
4. Build your interval or test using these "difference" values.
5. Hypothesis Testing for Averages
This is where we test a claim about a population parameter (usually the mean \(\mu\) or median).
The Three Main Tests:
1. Normal (z) Test: Used for means when \(n\) is large or \(\sigma\) is known.
2. t-test: Used for means when \(n\) is small, \(\sigma\) is unknown, and the population is Normal.
3. Wilcoxon Single Sample Signed-Rank Test: This is a non-parametric test. We use this to test the median. It is great when you don't want to assume the data follows a Normal distribution, but the distribution must be symmetrical.
Memory Aid for Wilcoxon: Remember W-S-S (Wilcoxon - Symmetrical - Signed-rank). It cares about the rank of the differences, not just the raw numbers.
6. Making Decisions using Confidence Intervals
You can use a Confidence Interval to perform a hypothesis test!
If you have a hypothesised value (let’s say someone claims the average weight is 50kg) and you build a 95% Confidence Interval that does not include 50kg, you can reject their claim at the 5% significance level.
Quick Review Box:
- Value inside CI \(\rightarrow\) No evidence to reject the claim.
- Value outside CI \(\rightarrow\) Evidence to reject the claim.
Common Mistakes to Avoid
- Dividing by the wrong thing: When calculating the Standard Error, always divide the standard deviation by \(\sqrt{n}\), not just \(n\).
- Using z instead of t: If the sample is small (like \(n=10\)) and you don't know the population variance, you must use the t-distribution.
- Forgetting symmetry: You cannot use the Wilcoxon test unless you state the assumption that the underlying distribution is symmetrical.
Key Takeaway for the Chapter: Inference is about moving from the "Known" (Sample) to the "Unknown" (Population). Whether you use a point estimate, a confidence interval, or a hypothesis test, you are always accounting for the fact that samples vary!