Welcome to Unit 8: Inference for Categorical Data (Chi-Square)!
In previous units, we spent a lot of time talking about means and proportions. But what happens when we want to look at a whole distribution of categories? For example, instead of just asking "What proportion of students like pizza?", what if we want to know if the distribution of favorite snacks (pizza, tacos, burgers, salads) matches what the cafeteria staff claims? That is where the Chi-Square (\(\chi^2\)) tests come in!
Don't worry if this seems a bit different from the Z-tests and T-tests you’ve done before. While the math looks a little new, the logic of "Hypothesis, Conditions, Calculations, and Conclusion" remains exactly the same. Let’s dive in!
1. What is the Chi-Square Statistic?
The Chi-Square statistic measures how much our "Observed" counts (what we actually saw in our sample) differ from our "Expected" counts (what we expected to see if the null hypothesis were true).
The formula looks like this:
\(\chi^2 = \sum \frac{(O - E)^2}{E}\)
Breaking it down:
• O = Observed count (the actual data)
• E = Expected count (the data we predicted)
• \(\sum\) = Sum (add them all up for every category!)
Think of it like this: If your observed counts are very close to your expected counts, \(\chi^2\) will be a small number. If they are very different, \(\chi^2\) will be a large number. A large Chi-Square value gives us evidence to reject the null hypothesis!
Did you know? The Chi-Square distribution is always skewed to the right and starts at zero. This is because we square the differences, so the value can never be negative!
2. The Chi-Square Goodness-of-Fit Test
We use this test when we have one sample and one categorical variable. We want to see if the sample "fits" a specific population distribution.
Example: A company claims that their bag of fruit snacks contains 20% strawberry, 30% grape, and 50% orange. You buy a bag and count the flavors to see if the company is telling the truth. This is a Goodness-of-Fit test!
Hypotheses for Goodness-of-Fit:
\(H_0\): The specified distribution of [variable] is correct.
\(H_a\): The specified distribution of [variable] is NOT correct.
Conditions (The "Big Three"):
1. Random: The data must come from a random sample or randomized experiment.
2. 10% Rule: When sampling without replacement, \(n < 10\%\) of the population.
3. Large Counts: All expected counts must be at least 5. (Wait! This is important: it's the expected counts, not the observed counts that must be \(\ge 5\)).
Degrees of Freedom (df):
For Goodness-of-Fit, \(df = \text{number of categories} - 1\).
Key Takeaway: Goodness-of-Fit checks if one list of data matches a claimed distribution. Always check that your Expected counts are at least 5 before proceeding!
3. Inference for Two-Way Tables
Sometimes our data isn't just a single list; it’s a table (rows and columns). For these, we have two types of tests: Homogeneity and Independence. They use the same math, but the way we collect data is different.
Test for Homogeneity
When to use: You have two or more independent groups (samples) and you are measuring one categorical variable for each. You want to see if the groups are "the same" (homogeneous).
Example: Do freshmen, sophomores, juniors, and seniors have the same distribution of music preferences (Pop, Rock, Hip Hop)? (Four groups, one variable).
Test for Independence
When to use: You have one single sample and you are measuring two categorical variables for each individual. You want to see if there is an association between the two variables.
Example: Take a sample of 100 adults. Record their gender and whether or not they are colorblind. (One group, two variables).
Memory Aid:
• Homogeneity = Hundreds of samples (okay, maybe just 2 or more).
• Independence = Individual sample (just 1).
4. Calculating Expected Counts and Degrees of Freedom for Tables
When working with tables, we need a special way to find the expected counts if the null hypothesis were true.
Expected Count Formula:
\(E = \frac{\text{Row Total} \times \text{Column Total}}{\text{Table Total}}\)
Degrees of Freedom for Tables:
\(df = (\text{number of rows} - 1) \times (\text{number of columns} - 1)\)
Step-by-Step Process for Chi-Square:
1. State: Define your \(H_0\) and \(H_a\) and your alpha level (\(\alpha\)).
2. Plan: Identify the test (Goodness-of-Fit, Homogeneity, or Independence) and check conditions (Random, 10%, Large Counts).
3. Do: Calculate the \(\chi^2\) statistic, the \(df\), and the P-value.
4. Conclude: Compare P-value to \(\alpha\). If \(P < \alpha\), reject \(H_0\) and say you have evidence for \(H_a\).
5. Common Mistakes to Avoid
• Mixing up the tests: Always ask yourself "How many samples were taken?" to decide between Homogeneity and Independence.
• Using percentages instead of counts: Chi-Square tests must be performed on whole number counts. If you are given percentages, multiply them by the sample size first!
• Checking the wrong Large Counts: Remember, the "Large Counts" condition applies to Expected counts, not the ones you observed in the data.
• Forgetting df: If you get the degrees of freedom wrong, your P-value will be wrong. For tables, it's \((r-1)(c-1)\).
Quick Review Box
Chi-Square Goodness-of-Fit
• 1 Sample, 1 Variable
• \(df = \text{categories} - 1\)
Chi-Square Homogeneity
• 2+ Samples, 1 Variable
• \(df = (r-1)(c-1)\)
Chi-Square Independence
• 1 Sample, 2 Variables
• \(df = (r-1)(c-1)\)
All Chi-Square tests require Expected Counts \(\ge 5\)!
Final Encouragement: Chi-Square is often one of the favorite units for students because the table math is very logical and consistent. Just keep an eye on your "Expected" vs "Observed" and you'll do great!