Welcome to the World of "Goodness of Fit"!
Ever wondered if a coin is actually fair, or if the number of goals scored in a football match really follows a predictable pattern? In this chapter, we learn how to use the Goodness of Fit test to see if our real-world data matches a specific mathematical model.
Don't worry if this seems a bit abstract at first! Think of it like trying on a new pair of jeans: the "Goodness of Fit" test simply tells us how well the "jeans" (our mathematical model) fit the "person" (our actual data). If they don't fit, we need a different model!
1. The Core Idea: The Chi-Squared (\(\chi^2\)) Test
To measure the "fit," we use the Chi-Squared (\(\chi^2\)) test. This test compares what we actually saw (the Observed frequencies, \(O\)) with what we would expect to see if our model were true (the Expected frequencies, \(E\)).
The Test Statistic Formula
We use this formula to calculate our test statistic:
\[ \chi^2 = \sum \frac{(O - E)^2}{E} \]
Breaking down the formula:
1. \(O - E\): Find the difference between what happened and what we expected.
2. \((O - E)^2\): Square that difference (this makes everything positive so they don't cancel each other out!).
3. \(/ E\): Divide by the expected value to scale the difference.
4. \(\sum\): Add them all up for every category.
Quick Review:
If the total \(\chi^2\) value is small, the model is a good fit (the "jeans" fit!).
If the total \(\chi^2\) value is large, the model is a poor fit (the "jeans" are too tight or too loose!).
2. The "Golden Rule": Pooling Data
There is one very important rule you must remember for a Chi-Squared test to be valid: Every Expected Frequency (\(E\)) must be at least 5.
Why? Because we divide by \(E\) in our formula. If \(E\) is a very small number (like 0.5), it makes the \(\chi^2\) value explode, even if the difference isn't that important. This ruins the test!
What if \(E < 5\)?
If an expected frequency is less than 5, you must pool (combine) that category with an adjacent one.
Example: If you are looking at the number of goals scored (0, 1, 2, 3, 4+) and the "Expected" value for 4+ goals is only 2, you would combine the "3 goals" and "4+ goals" categories into one single "3 or more goals" category.
Key Takeaway: Always check your \(E\) values before you start the main calculation. If they are under 5, pool them!
3. Degrees of Freedom (\(v\))
In Statistics, "Degrees of Freedom" (represented by the Greek letter \(v\), pronounced 'nu') tells us how much "wiggle room" our data has. To find the critical value from your formula booklet, you must calculate this correctly.
The general formula for Goodness of Fit is:
\(v = n - 1 - k\)
Where:
- \(n\): The number of categories (after pooling).
- \(1\): We always subtract 1 because the total frequency is fixed.
- \(k\): The number of parameters we had to estimate from the sample data to calculate the expected frequencies.
How many parameters (\(k\)) do I subtract?
This is where students often get tripped up. Here is a handy guide:
- Uniform/Specified Distribution: \(k = 0\) (usually no parameters are estimated).
- Binomial (\(B(n, p)\)): \(k = 1\) (if we have to calculate \(p\) from the data).
- Poisson (\(Po(\lambda)\)): \(k = 1\) (if we have to calculate the mean \(\lambda\) from the data).
- Exponential (\(Exp(\lambda)\)): \(k = 1\) (if we have to calculate \(\lambda\) from the data).
- Normal (\(N(\mu, \sigma^2)\)): \(k = 2\) (if we have to calculate both the mean \(\mu\) and variance \(\sigma^2\) from the data).
Did you know? If the question gives you the parameters (e.g., "Test if the data fits a Poisson distribution with mean 3"), then \(k = 0\) because you didn't have to estimate them yourself!
4. Step-by-Step: Conducting the Test
Follow these steps to ensure you don't miss any marks:
- State your Hypotheses:
\(H_0\): The [Distribution] is a good fit for the data.
\(H_1\): The [Distribution] is not a good fit for the data. - Calculate Expected Frequencies (\(E\)): Multiply the total frequency by the probability of each category occurring in that distribution.
- Check for Pooling: If any \(E < 5\), combine those categories.
- Calculate the Test Statistic: Use \(\sum \frac{(O - E)^2}{E}\).
- Determine Degrees of Freedom: \(v = n - 1 - k\) (remember \(n\) is the number of cells after pooling).
- Find the Critical Value: Look this up in your \(\chi^2\) table using \(v\) and the significance level (usually 5%).
- Make a Decision:
- If your calculated \(\chi^2\) is greater than the critical value, Reject \(H_0\).
- If your calculated \(\chi^2\) is less than the critical value, Do not reject \(H_0\). - Conclusion in Context: "There is [significant/insufficient] evidence to suggest that the [Distribution] is a good fit..."
5. Common Pitfalls to Avoid
- Forgetting to Pool: This is the most common mistake. Always check the \(E\) values first!
- Using \(O\) in the Denominator: The formula is divided by \(E\), not \(O\). Remember: "O minus E squared over E".
- Incorrect \(n\): Using the number of categories before pooling instead of after pooling when calculating degrees of freedom.
- Yates' Correction: You might see this in older textbooks or other courses. For Pearson Edexcel 9ST0, Yates' correction is NOT required. Ignore it!
6. Summary Table for Parameters (\(k\))
If you're ever in doubt about what \(k\) to use, refer back to this "Quick Review" box:
Quick Review: Estimating Parameters
1. Did I calculate the mean from the raw data? (Poisson/Exponential/Binomial/Normal) \( \rightarrow \) Subtract 1.
2. Did I calculate the standard deviation from the raw data? (Normal only) \( \rightarrow \) Subtract another 1.
3. If the question gave me the values to use \( \rightarrow \) Subtract 0.
Example: You are testing if a set of 100 observations follows a Poisson distribution. You calculate the sample mean to be 2.4 to help find your expected values. You have 6 categories, and no pooling is needed.
Your degrees of freedom would be: \(v = 6 - 1 - 1 = 4\).
Final Encouragement: Goodness of Fit is a very structured topic. Once you master the step-by-step process and remember the "Rule of 5" for pooling, you'll find these questions are great opportunities to pick up marks in Paper 2!