Welcome to Goodness of Fit and Contingency Tables!

Ever wondered if a "random" dice is actually fair, or if your favorite snack brand really puts the same amount of chocolate in every bar? In this chapter of Unit S3, we learn how to use the Chi-squared (\(\chi^2\)) test to compare what we actually see in real life (Observed data) with what a mathematical model tells us should happen (Expected data). It’s all about seeing how well a model "fits" the reality!

1. The Big Idea: The \(\chi^2\) Test

The core of this chapter is the Goodness of Fit test. We use a specific formula to calculate a "test statistic." Think of this statistic as a score that tells you how much your real-world observations "miss" the target of your mathematical model.

The Formula

The formula for the \(\chi^2\) statistic is:
\( \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} \)

Where:
\(O_i\) is the Observed frequency (what actually happened).
\(E_i\) is the Expected frequency (what the model predicted).

How to Interpret the Score

• If the \(\chi^2\) value is small, it means the Observed and Expected values are very close. The model is a Good Fit.
• If the \(\chi^2\) value is large, it means there is a big gap between reality and the model. The model is a Poor Fit.

Quick Review Box

\(H_0\) (Null Hypothesis): The data fits the specified distribution (e.g., "The dice is fair").
\(H_1\) (Alternative Hypothesis): The data does not fit the specified distribution (e.g., "The dice is biased").

Key Takeaway: We are measuring the "distance" between what we saw and what we expected. If that distance is too big, we reject the model!

2. The "Rule of Five" and Combining Cells

Don't worry if the math seems a bit heavy; there is one very important "Golden Rule" in \(\chi^2\) testing: Expected frequencies (\(E_i\)) must be at least 5.

If an Expected frequency is less than 5, the \(\chi^2\) test becomes unreliable. To fix this, we combine adjacent cells until the new "combined" Expected frequency is 5 or more.

Example: If you are testing a dice and the Expected frequency for rolling a '6' is only 3, you might combine the '5' and '6' categories together into one "5 or 6" category.

Common Mistake: Students often look at the Observed (\(O\)) values to see if they are less than 5. Stop! Only check the Expected (\(E\)) values for this rule.

3. Calculating Expected Frequencies

Depending on the distribution you are testing, you calculate \(E_i\) differently. The syllabus focuses on these main types:

A. Discrete Uniform Distribution

This is the easiest! If everything is equally likely (like a fair dice), then:
\( E_i = \frac{\text{Total Frequency}}{\text{Number of Categories}} \)

B. Binomial and Poisson Distributions

You use the probability formulas you learned in S2:
\( E_i = P(X = i) \times \text{Total Frequency} \)

Important Note: If you have to estimate parameters (like the mean \(\lambda\) for Poisson or probability \(p\) for Binomial) from the data because they aren't given, it will affect your "Degrees of Freedom" later!

C. Normal and Continuous Uniform Distributions

For continuous data, you calculate the probability of falling into a range (class interval) and multiply by the total frequency.

Did you know? We use the \(\chi^2\) test in medicine to see if a new drug's side effects match what was predicted in clinical trials!

4. Degrees of Freedom (\(v\))

The Degrees of Freedom (represented by the Greek letter \(\nu\) or just \(v\)) tell us which \(\chi^2\) distribution curve to use from our formula booklet.

The General Rule:

\( v = n - 1 - c \)

Where:
\(n\) is the number of cells after any combining.
\(1\) is always subtracted because the total frequencies must match.
\(c\) is the number of estimated parameters.

How many parameters (\(c\)) to subtract?

Discrete Uniform: \(c = 0\) (No parameters to estimate).
Poisson: \(c = 1\) (if you have to calculate \(\lambda = \text{mean}\)).
Binomial: \(c = 1\) (if you have to calculate \(p\)).
Normal: \(c = 2\) (if you have to calculate both \(\mu\) and \(\sigma^2\)).

Key Takeaway: Always count your cells after you have combined them for the "Rule of Five," then subtract 1, then subtract any parameters you calculated yourself.

5. Contingency Tables

Sometimes we want to know if two variables are independent. For example, "Is your favorite color independent of your gender?" We use a Contingency Table for this.

Step-by-Step: Finding Expected Frequencies

In a contingency table, you calculate the Expected value for each cell using the "Row-Column-Total" rule:
\( E = \frac{\text{Row Total} \times \text{Column Total}}{\text{Grand Total}} \)

Degrees of Freedom for Tables

For a table with \(r\) rows and \(c\) columns:
\( v = (r - 1) \times (c - 1) \)

Quick Review: Contingency Table Hypotheses

\(H_0\): There is no association between the two variables (they are independent).
\(H_1\): There is an association between the two variables (they are dependent).

Don't worry if this seems tricky! Just remember:
1. Calculate totals for every row and column.
2. Use the Row-Column formula for each cell.
3. Use the \(\sum \frac{(O-E)^2}{E}\) formula just like before!

6. Summary of the Testing Process

To succeed in your exam, follow these steps every time:

  1. State your Null (\(H_0\)) and Alternative (\(H_1\)) hypotheses clearly.
  2. Calculate the Expected frequencies (\(E\)).
  3. Check the Rule of Five: Combine cells if any \(E < 5\).
  4. Calculate your test statistic using \( \sum \frac{(O-E)^2}{E} \).
  5. Determine the Degrees of Freedom (\(v\)).
  6. Find the critical value from the \(\chi^2\) table in your formula book using the given significance level (e.g., 5%).
  7. Compare and Conclude: If your calculated value is greater than the critical value, reject \(H_0\).

Final Encouragement: You've got this! Just take it one step at a time. The most common errors are forgetting to combine cells or getting the degrees of freedom wrong. Double-check those two things, and you'll be a \(\chi^2\) master!