Introduction to Chi-Squared Tests for Association

Hello there! Welcome to one of the most practical parts of your Statistics course. Have you ever wondered if there is a genuine link between two things—like whether your favorite music genre is related to your age group, or if people’s choice of breakfast depends on their job?

In this chapter, we are going to learn about the Chi-Squared (\(\chi^2\)) Test for Association. This test is a brilliant tool that helps us decide if two categorical variables are independent (have no connection) or if there is an association (they are linked) between them. Don't worry if this seems a bit heavy at first; we’ll break it down into simple, manageable steps!

Did you know? The "Chi" in Chi-Squared is a Greek letter pronounced like "Kai" (rhymes with "sky"), not "Chee"!


1. Setting the Scene: Contingency Tables

Before we can do any math, we need to organize our data. We use something called an \(n \times m\) contingency table.

Imagine we ask 100 students about their favorite sport and their year group. The table might look like this:

Example Table:
Year 12: Football (20), Tennis (10), Swimming (5)
Year 13: Football (15), Tennis (30), Swimming (20)

In this case, we have 2 rows (Year 12 and Year 13) and 3 columns (Football, Tennis, Swimming). We call this a \(2 \times 3\) table.

Key Terms:

  • Observed Frequencies (\(O_i\)): These are the actual numbers we collected from our survey or experiment.
  • Expected Frequencies (\(E_i\)): These are the numbers we would expect to see if there was absolutely no connection between the variables.

Quick Review: An \(n \times m\) table simply means a table with \(n\) rows and \(m\) columns. Always count your rows and columns first!


2. The "What If" Phase: Calculating Expected Frequencies

To see if there is an association, we first calculate what the table would look like if the variables were independent. For every single cell in your table, you need to calculate an Expected Frequency (\(E_i\)) using this simple recipe:

\(E_i = \frac{\text{Row Total} \times \text{Column Total}}{\text{Grand Total}}\)

A Vital Rule to Remember!

For a Chi-Squared test to be valid, all Expected Frequencies (\(E_i\)) must be greater than 5.
If you calculate an \(E_i\) that is 5 or less, you usually have to combine rows or columns (amalgamate them) until all the expected values are safely above 5.

Analogy: Think of this like a party. If a room has fewer than 5 people, it’s too small for a good dance (the test won't work), so you merge it with the room next door!


3. The Chi-Squared Statistic (\(\chi^2\))

Now we compare "Reality" (Observed) with "Theory" (Expected). We use the following formula to find our test statistic:

\(\chi^2_{calc} = \sum \frac{(O_i - E_i)^2}{E_i}\)

Step-by-Step Process:

  1. For each cell, subtract the Expected value from the Observed value (\(O - E\)).
  2. Square that number (this gets rid of any pesky minus signs!).
  3. Divide that squared result by the Expected value for that cell.
  4. Add up all the results from every cell in the table.

Key Takeaway: A large \(\chi^2\) value suggests that Reality is very different from Theory, meaning there is likely an association!


4. Degrees of Freedom (\(df\))

To look up our "critical value" in the statistical tables, we need to know the Degrees of Freedom. This tells us how much "wiggle room" the data has. For an \(r \times c\) table (rows \(\times\) columns):

\(df = (r - 1) \times (c - 1)\)

Example: In our \(2 \times 3\) sports table, the \(df = (2 - 1) \times (3 - 1) = 1 \times 2 = 2\).


5. Yates’ Correction (For \(2 \times 2\) Tables Only)

Sometimes, when we have a small table (specifically a \(2 \times 2\) table), the standard formula can be a bit too "generous." To be more accurate, we use Yates’ Continuity Correction.

You only apply this if you have a \(2 \times 2\) contingency table. The formula changes slightly:

\(\chi^2_{Yates} = \sum \frac{(|O_i - E_i| - 0.5)^2}{E_i}\)

The \(|O_i - E_i|\) part just means "take the positive difference." Then you subtract 0.5 before squaring. This "shrinks" the difference slightly to make the test more conservative.

Common Mistake: Students often try to use Yates' correction on \(2 \times 3\) or \(3 \times 3\) tables. Don't do it! It is strictly for \(2 \times 2\) tables.


6. Identifying Sources of Association

If your test concludes that there is an association, the examiner might ask: "Where is this association coming from?"

To answer this, look back at your calculations for \(\frac{(O_i - E_i)^2}{E_i}\). The cell with the largest contribution (the biggest number) is the one where the difference between reality and theory is the greatest.

Example: If "Year 13 Students" and "Tennis" has a massive contribution to the \(\chi^2\) total, you would say: "The main source of association is that Year 13 students played tennis much more (or less) than expected."


Summary Checklist for Exam Success

  • Hypotheses: Always state \(H_0\) (Variables are independent) and \(H_1\) (Variables are associated).
  • Check \(E_i\): Are all expected values \( > 5\)? If not, merge rows/columns.
  • Yates?: If it's a \(2 \times 2\) table, use the correction formula.
  • Degrees of Freedom: Use \((r-1)(c-1)\).
  • Compare: If your \(\chi^2_{calc} > \text{Critical Value}\), you reject \(H_0\).
  • Context: Always write your final conclusion in the context of the original question (e.g., "There is evidence to suggest an association between age and sport choice").

Memory Aid: If the \(\chi^2\) is High, the Null must Go! (Reject \(H_0\)).