Welcome to the Quality of Tests!
In your previous statistics studies, you’ve learned how to carry out hypothesis tests to see if something has changed—like whether a new coin is biased or if a medication is effective. But how do we know if our test is actually any good? Does it make mistakes often? Does it have the "strength" to spot a change when one really happens?
In this chapter, we explore the Quality of Tests. We will learn how to measure the "success rate" of a test and understand the two main ways a statistical test can go wrong. Don’t worry if it sounds a bit abstract at first; we’ll use plenty of real-world analogies to make it clear!
1. Type I and Type II Errors
Even the best statistical test can make a mistake. Because we are using samples to make guesses about populations, there is always a chance that our sample is just plain weird and leads us to the wrong conclusion.
What is a Type I Error?
A Type I Error happens when you reject the null hypothesis (\(H_0\)) even though it is actually true. In other words, you think there’s a change or an effect, but there isn’t.
Real-world analogy: A "False Alarm." Imagine a smoke detector going off because you burnt some toast. There isn't a real fire (\(H_0\) is true), but the alarm says there is (\(H_0\) is rejected).
The probability of making a Type I error is denoted by the Greek letter alpha (\(\alpha\)). For a test with a fixed critical region, the probability of a Type I error is simply the actual significance level of the test.
What is a Type II Error?
A Type II Error happens when you fail to reject the null hypothesis (\(H_0\)) even though it is actually false. You miss the change that was actually there.
Real-world analogy: A "Missed Signal." Imagine a fire starting in the kitchen, but the smoke detector stays silent. There is a real fire (\(H_0\) is false), but the alarm fails to go off (\(H_0\) is not rejected).
The probability of making a Type II error is denoted by the Greek letter beta (\(\beta\)).
Quick Review Table:
The Reality: \(H_0\) is True | Your Decision: Reject \(H_0\) = Type I Error (\(\alpha\))
The Reality: \(H_0\) is False | Your Decision: Fail to Reject \(H_0\) = Type II Error (\(\beta\))
Memory Aid:
Type I: Incorrectly Identified a change (False Alarm).
Type II: II (Too) blind to see the change (Missed Signal).
Key Takeaway: A Type I error is a "false positive," and a Type II error is a "false negative." We want the probabilities of both to be as small as possible!
2. Size and Power of a Test
Now that we know about errors, we can use them to define how "good" a test is using two key terms: Size and Power.
The Size of a Test
The Size of a test is just another name for the probability of a Type I error (\(\alpha\)).
\( \text{Size} = P(\text{Type I Error}) = P(\text{Reject } H_0 | H_0 \text{ is true}) \)
The Power of a Test
The Power of a test is its ability to correctly spot a change. It is the probability of rejecting the null hypothesis when it is actually false (which is exactly what we want a test to do!).
Mathematically, Power is linked to the Type II error (\(\beta\)):
\( \text{Power} = 1 - P(\text{Type II Error}) \)
\( \text{Power} = 1 - \beta \)
High power is good! It means the test is "powerful" enough to detect that something has changed.
Did you know?
You can increase the power of a test by increasing the sample size (\(n\)). A bigger sample gives you more evidence, making you less likely to miss a real effect!
Key Takeaway: Size = Probability of a false alarm. Power = Probability of a successful detection. We want a small Size and a large Power.
3. The Power Function
The probability of a Type II error (and therefore the Power) depends on what the actual new value of the parameter is.
For example, if you are testing if a coin is biased (\(H_0: p = 0.5\)), the test will find it much easier to reject \(H_0\) if the true probability is \(p=0.9\) than if it is \(p=0.51\).
The Power Function is a function (often plotted as a graph) that shows the Power of the test for all possible true values of the parameter being tested.
What does the graph look like?
- For a one-tailed test, the power function usually starts near the significance level and increases toward 1 as the parameter moves further away from the null value.
- If the true value is exactly equal to the null hypothesis value, the Power is simply the Size of the test.
Key Takeaway: The power function helps us visualize how effective our test is at different "alternative" realities. The steeper the curve, the more sensitive the test is.
4. Step-by-Step: Calculating Errors and Power
Questions in Further Statistics 1 often ask you to calculate these probabilities using distributions like the Binomial, Poisson, or Geometric. Here is how to approach them:
Step 1: Define the Critical Region
Before you can calculate errors, you must know exactly what values of your test statistic lead to rejecting \(H_0\). (e.g., "Reject \(H_0\) if \(X \geq 8\)").
Step 2: Calculate Type I Error (Size)
Use the parameter value from \(H_0\).
\( P(\text{Type I Error}) = P(\text{X is in the Critical Region given } H_0 \text{ is true}) \)
Step 3: Calculate Type II Error (\(\beta\))
You will be given a specific "alternative" value for the parameter (let's call it \(\lambda_1\) or \(p_1\)).
\( P(\text{Type II Error}) = P(\text{X is NOT in the Critical Region given the parameter is } \lambda_1) \)
Step 4: Calculate Power
Simply do \( 1 - P(\text{Type II Error}) \).
Example:
Suppose \(H_0: \lambda = 3\) and \(H_1: \lambda > 3\). Your critical region is \(X \geq 7\).
To find the Size, calculate \(P(X \geq 7)\) using \(\text{Po}(3)\).
To find the Power if the true \(\lambda\) is 5, calculate \(P(X \geq 7)\) using \(\text{Po}(5)\).
Common Mistake to Avoid:
When calculating the Type II error, students often accidentally calculate the probability of being in the critical region again. Remember: Type II error is failing to reject, so you want the probability of being outside the critical region!
Key Takeaway: Always check which parameter value you are using for which calculation. \(H_0\)'s value is for Type I; the "new" value is for Type II/Power.
Summary Checklist
- Can you define Type I and Type II errors in words?
- Do you know that Size is the probability of a Type I error?
- Do you know that Power is \(1 - P(\text{Type II Error})\)?
- Can you find the critical region for Binomial or Poisson tests and use it to find error probabilities?
- Can you explain how increasing sample size affects the power of a test?
Don't worry if this seems tricky at first! The more you practice shifting between the "Null world" (for Type I) and the "Alternative world" (for Type II), the more natural it will feel.