Welcome to Chi-Squared (\(\chi^2\)) Tests for Association!

Have you ever wondered if there is a genuine link between two things? For example, does your choice of favorite sport depend on your age group? Or is the type of music people listen to linked to where they live? In statistics, we don't just "guess" if there is a connection; we use a mathematical tool called the Chi-Squared Test for Association to prove it!

In this chapter, you will learn how to take raw data, organize it, and calculate a "score" to decide if two variables are independent or if there is a significant "association" (a fancy word for a link) between them. Don't worry if this seems tricky at first—we will take it one step at a time!


1. Setting the Stage: Contingency Tables

Before we can do any math, we need to organize our data. We use an \(n \times m\) contingency table. This is simply a grid where the rows represent one category and the columns represent another.

Example: Imagine we ask 100 students if they prefer tea or coffee. We also record if they are in Year 12 or Year 13.

The Observed Frequencies (\(O_i\)): these are the actual numbers you collected from your survey.
• Row 1: Year 12 students
• Row 2: Year 13 students
• Column 1: Tea drinkers
• Column 2: Coffee drinkers

The "size" of the table is described as (rows) \(\times\) (columns). The example above is a \(2 \times 2\) table. If we added a "Year 11" row, it would become a \(3 \times 2\) table.

Quick Review:
Always calculate the Row Totals, Column Totals, and the Grand Total (the sum of everything) before you start. You'll need these for the next step!


2. The "What If" Scenario: Expected Frequencies

To see if there is a link, we first have to imagine what the data should look like if there was no link at all. We call these the Expected Frequencies (\(E_i\)).

The Golden Rule for Expected Frequencies:
For any cell in your table, calculate:
\(E = \frac{\text{Row Total} \times \text{Column Total}}{\text{Grand Total}}\)

Important Convention (SE3):
For a Chi-Squared test to be accurate, every single Expected Frequency (\(E_i\)) must be greater than 5.
Why? If the expected numbers are too small, the test becomes "unstable" and unreliable. If you find an \(E_i < 5\) in a real-world problem, you might need to combine rows or columns to make the groups bigger!

Key Takeaway: \(O\) is what we Observed in real life. \(E\) is what we Expected if the two variables had nothing to do with each other.


3. The Chi-Squared Statistic: Measuring the Gap

Now we calculate how different our real-life data (\(O\)) is from our "no-link" data (\(E\)). We use the Chi-Squared (\(\chi^2\)) formula:

\(\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}\)

Step-by-Step Guide to calculating \(\chi^2\):
1. For each cell in the table, subtract the Expected value from the Observed value: \((O - E)\).
2. Square that number: \((O - E)^2\). (This makes sure negative numbers don't cancel out the positive ones!)
3. Divide that result by the Expected value: \(\frac{(O - E)^2}{E}\).
4. Sum (\(\sum\)) all those values together. The final total is your test statistic.

Memory Aid: Think of \(\chi^2\) as a "Difference Detector." If the Observed values are very close to the Expected values, \(\chi^2\) will be a small number (meaning no association). If they are very different, \(\chi^2\) will be a large number (meaning there probably is a link!).


4. Degrees of Freedom (\(v\))

Before we can look up our "score" in a statistical table, we need to know the Degrees of Freedom. This tells us how much "wiggle room" the data has.

For a contingency table with \(r\) rows and \(c\) columns:
\(v = (r - 1) \times (c - 1)\)

Example: In a \(3 \times 2\) table, the degrees of freedom would be \((3 - 1) \times (2 - 1) = 2 \times 1 = 2\).

Common Mistake: Don't include the "Total" rows or columns when counting \(r\) and \(c\)! Only count the rows and columns that contain the actual categories of data.


5. Making a Decision: Hypotheses and Conclusions

Every statistical test needs a starting assumption. We call these Hypotheses.

\(H_0\) (The Null Hypothesis): There is no association between the two variables. (They are independent).
\(H_1\) (The Alternative Hypothesis): There is an association between the two variables.

How to conclude:
1. Find the Critical Value from your \(\chi^2\) table using your degrees of freedom (\(v\)) and the significance level (usually 5% or 0.05).
2. If your calculated \(\chi^2\) is GREATER than the Critical Value, you Reject \(H_0\). There is evidence of a link!
3. If your calculated \(\chi^2\) is LESS than the Critical Value, you Fail to Reject \(H_0\). There isn't enough evidence to say a link exists.


6. Identifying Sources of Association (SE4)

Sometimes, the test tells us there is a link, but it doesn't tell us where it is. To find the "Source of Association," we look back at our individual \(\frac{(O - E)^2}{E}\) values for each cell.

Look for the biggest number: The cell that contributed the most to the final \(\chi^2\) sum is the biggest source of the association. This is where the difference between "what we saw" and "what we expected" was the most extreme.

Example Interpretation: "The main source of association was Year 13 students drinking much more coffee than expected." This adds context to your mathematical answer and is vital for gaining full marks in exam questions!


Quick Summary Checklist

Organize: Create the contingency table and calculate totals.
Hypothesize: State \(H_0\) (no link) and \(H_1\) (link exists).
Expect: Calculate \(E = \frac{\text{Row Total} \times \text{Col Total}}{\text{Grand Total}}\). Ensure all \(E > 5\).
Calculate: Find \(\chi^2\) using \(\sum \frac{(O - E)^2}{E}\).
Degrees of Freedom: Use \(v = (r-1)(c-1)\).
Compare: Check your value against the table's Critical Value.
Identify: If there's a link, find the cell with the largest contribution to the \(\chi^2\) score.

Don't worry if this seems like a lot of steps! With a bit of practice, calculating the table becomes second nature. Just remember: you're just measuring how much "reality" differs from "random chance."