Introduction to Contingency Tables

Welcome! In this chapter, we are going to explore how statisticians figure out if there is a relationship between two different categorical variables. For example, is there a link between the sport someone plays and the brand of shoes they prefer? Or does the town you live in affect which political party you vote for?

In Paper 2, you are focusing on Statistical Inference. This means we are using data from a sample to make a "best guess" or an inference about the whole population. Contingency tables are a powerful tool for this because they allow us to test for independence—checking if one thing happens completely regardless of the other.

Don't worry if this seems a bit abstract right now; we’ll break it down into simple steps that you can follow every time!

1. What is a Contingency Table?

A contingency table (sometimes called a two-way table) is simply a grid used to summarize the relationship between two categorical variables. One variable is shown in the rows, and the other is shown in the columns.

Example: A survey asks 100 students if they prefer Tea or Coffee and whether they are in Year 12 or Year 13.

Observed Frequencies (O): These are the actual numbers you collect from your research. You might see a table like this:

- Year 12: 20 prefer Tea, 30 prefer Coffee (Total = 50)
- Year 13: 25 prefer Tea, 25 prefer Coffee (Total = 50)
- Column Totals: 45 prefer Tea, 55 prefer Coffee
- Grand Total (N): 100

Key Takeaway

A contingency table organizes raw counts of data into rows and columns so we can look for patterns between two categories.

2. The Chi-Squared (\(\chi^2\)) Test for Independence

To decide if the two variables are actually linked, we perform a Chi-Squared (\(\chi^2\)) test. We are testing the Null Hypothesis (\(H_0\)).

The Hypotheses:
\(H_0\): The two variables are independent (there is no association).
\(H_1\): The two variables are not independent (there is an association/link).

Step-by-Step: Finding the Expected Frequencies (E)

To see if the variables are independent, we calculate what the table should look like if there was absolutely no link between them. We call these Expected Frequencies.

The formula for each cell in the table is:
\(E = \frac{\text{Row Total} \times \text{Column Total}}{\text{Grand Total}}\)

Quick Review:
O = Observed (The real data you have)
E = Expected (The "perfectly independent" data you calculate)

Step-by-Step: The Test Statistic

Once you have your O and E values for every cell, you calculate the \(\chi^2\) test statistic using this formula:
\(\chi^2 = \sum \frac{(O - E)^2}{E}\)

Think of it like this: We are measuring the "gap" between reality (O) and the independent model (E). The bigger the \(\chi^2\) value, the more likely it is that the variables are not independent.

Key Takeaway

The \(\chi^2\) test compares what we saw to what we would expect to see if the two variables had nothing to do with each other.

3. Degrees of Freedom (\(df\))

To find the critical value in your statistical tables, you need to know the Degrees of Freedom. This tells us how much the data is "allowed" to vary.

For a contingency table, the formula is simple:
\(df = (\text{Number of Rows} - 1) \times (\text{Number of Columns} - 1)\)

Common Mistake to Avoid: Do not include the "Total" row or column when counting your rows and columns for this formula!

Example: In a \(3 \times 2\) table (3 rows, 2 columns):
\(df = (3 - 1) \times (2 - 1) = 2 \times 1 = 2\)

4. Important Rules: The "Rule of 5" and Pooling

The \(\chi^2\) test is an approximation. For it to be accurate, the Expected Frequencies (E) must be large enough. The Pearson Edexcel syllabus requires that all Expected Frequencies must be 5 or greater.

What if an Expected Frequency is less than 5?

If you calculate an \(E\) value and it is smaller than 5, you must pool (combine) rows or columns. This means you merge two categories that are similar to create a larger group.

Example: If you are testing "Ice Cream Flavors" and the expected frequency for "Mint" is 3, you might combine the "Mint" and "Chocolate" categories into one called "Mint & Chocolate" to get the frequency above 5.

Did you know?
In your exam, you do not need to use Yates' Correction. Even if you see it in older textbooks or online, you can ignore it for the 9ST0 specification! Just stick to the standard \(\chi^2\) formula.

Key Takeaway

Always check your Expected Frequencies first. If any are \( < 5 \), you must combine (pool) categories until all values are 5 or more.

5. Interpreting the Results

After calculating your \(\chi^2\) test statistic and finding your critical value (using your \(df\) and the significance level, usually 5%):

1. If your Calculated \(\chi^2\) > Critical Value: Reject \(H_0\). There is evidence of a link.
2. If your Calculated \(\chi^2\) < Critical Value: Accept \(H_0\) (Fail to reject). There is no evidence of a link.

Remember: Always write your final conclusion in the context of the question! Don't just say "Reject \(H_0\)," say "There is evidence to suggest an association between [Variable A] and [Variable B]."

Summary Checklist for Exams

- [ ] State your hypotheses clearly (\(H_0\) is always "Independent").
- [ ] Construct the table and calculate Expected Frequencies using \(\frac{RT \times CT}{GT}\).
- [ ] Check for the Rule of 5: Combine (pool) categories if any \(E < 5\).
- [ ] Calculate the \(\chi^2\) test statistic using \(\sum \frac{(O - E)^2}{E}\).
- [ ] Determine the Degrees of Freedom: \((r-1)(c-1)\).
- [ ] Compare your result to the Critical Value and conclude in context.

Top Tip: If you have to pool rows or columns, remember that your Degrees of Freedom will change because the number of rows or columns has decreased!