Introduction: Why do we Sample?

Ever wondered how pollsters predict election results by talking to only a few thousand people out of millions? Or how a chef knows if a giant pot of soup is salty enough by tasting just one spoonful? That, in a nutshell, is Sampling!

In H2 Mathematics, we study sampling to understand how we can make very accurate "guesses" about a huge group (the Population) by looking at a smaller group (the Sample). It saves time, money, and sometimes is the only way to get data without destroying everything (like testing how much pressure a glass bottle can take before it breaks!).

1. The Basics: Population vs. Sample

Before we dive into the math, let’s get our definitions straight:

  • Population: The entire collection of items or people you are interested in. (Example: All H2 Math students in Singapore).
  • Sample: A subset of the population that we actually measure. (Example: 50 students chosen from your school).
  • Simple Random Sample (SRS): A sample where every single member of the population has an equal chance of being selected. It’s like picking names out of a well-shaken hat.

Did you know? If a sample isn't random, it might be biased. For example, if you only ask people at a gym about their health, your results won't represent the whole country!


2. The Sample Mean \(\bar{X}\) as a Random Variable

This is where it gets interesting. If you take a sample of 10 people and calculate their average height, you get a number. If you take another sample of 10 different people, you’ll likely get a different average height.

Because the value of the sample mean changes depending on which sample you pick, we treat the sample mean as a Random Variable, denoted by \(\bar{X}\).

Key Properties of \(\bar{X}\):

If a population has a mean \(\mu\) and a variance \(\sigma^2\), then for a sample of size \(n\):

  1. The Expectation (The Center): \(E(\bar{X}) = \mu\).
    Translation: On average, the sample mean will be the same as the population mean.
  2. The Variance (The Spread): \(Var(\bar{X}) = \frac{\sigma^2}{n}\).
    Translation: As your sample size \(n\) gets bigger, your sample mean becomes more consistent (less spread out).

Memory Aid: Think of \(n\) as "Information." The more "Information" (\(n\)) you have, the less "Error" or "Spread" there is in your average!


3. Distribution of the Sample Mean

How does \(\bar{X}\) actually look on a graph? There are two main scenarios you need to know for your exams:

Scenario A: The Population is already Normal

If the population follows a Normal Distribution \(N(\mu, \sigma^2)\), then \(\bar{X}\) is always Normal, no matter how small the sample size is.

Formula: \(\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)\)

Scenario B: The Population is NOT Normal (The Central Limit Theorem)

This is the "Magic" of Statistics! Even if your population is weirdly shaped (skewed, flat, or bumpy), the distribution of \(\bar{X}\) will become approximately Normal if your sample size is sufficiently large.

The Rule of Thumb: In the H2 syllabus, "sufficiently large" usually means \(n \ge 30\).

The Central Limit Theorem (CLT) states:
If \(n\) is large (\(n \ge 30\)), then \(\bar{X} \approx N\left(\mu, \frac{\sigma^2}{n}\right)\) approximately.

Don't worry if this seems tricky! Just remember: Large \(n\) \(\rightarrow\) Normal distribution for the average. It’s like a crowd of chaotic people; individually they are unpredictable, but as a large group, their average behavior follows a predictable pattern.


4. Unbiased Estimates: Working with Real Data

In the real world, we often don't know the true population mean (\(\mu\)) or variance (\(\sigma^2\)). We have to use our sample data to estimate them.

1. Estimating the Population Mean (\(\mu\))

The best estimate is simply the sample mean \(\bar{x}\):

\(\text{Unbiased estimate of } \mu = \bar{x} = \frac{\sum x}{n}\)

2. Estimating the Population Variance (\(\sigma^2\))

This is a common trap! You might think you just use the sample variance formula, but to get an unbiased estimate (labeled \(s^2\)), we divide by \(n-1\) instead of \(n\).

The Formula:
\(s^2 = \frac{1}{n-1} \left[ \sum x^2 - \frac{(\sum x)^2}{n} \right]\)

Why \(n-1\)? Using \(n\) usually underestimates the true spread. Dividing by a slightly smaller number (\(n-1\)) corrects this "bias." We call this the unbiased estimate of the population variance.


5. Dealing with Summarized Data

Sometimes, exam questions won't give you a list of numbers. They will give you "summarized data" like \(\sum (x-a)\) or \(\sum (x-a)^2\). Don't panic! We just use adjusted versions of our formulas.

Step-by-Step for Summarized Data:

If you are given \(\sum (x-a)\) and \(\sum (x-a)^2\):

  1. Find the mean of the "shifted" data: \(\overline{x-a} = \frac{\sum (x-a)}{n}\)
  2. Find the actual population mean estimate: \(\bar{x} = a + \overline{x-a}\)
  3. Find the population variance estimate \(s^2\):
    \(s^2 = \frac{1}{n-1} \left[ \sum (x-a)^2 - \frac{(\sum (x-a))^2}{n} \right]\)

Quick Tip: Notice that the value of \(a\) doesn't change the variance! Variance measures spread, and shifting the whole group of numbers left or right (by \(a\)) doesn't change how spread out they are.


Common Mistakes to Avoid

  • Confusing \(\sigma^2\) and \(\frac{\sigma^2}{n}\): Use \(\sigma^2\) for one individual. Use \(\frac{\sigma^2}{n}\) when you are talking about the mean of a group.
  • Forgetting the "n-1": Always check if the question asks for the "sample variance" (divide by \(n\)) or the "unbiased estimate of population variance" (divide by \(n-1\)). In Sampling, we almost always want the \(n-1\) version.
  • CLT Conditions: Only invoke the Central Limit Theorem if the original population is not normal and \(n \ge 30\). If the population is already normal, you don't need CLT!

Key Takeaways

1. The sample mean \(\bar{X}\) is a random variable with mean \(\mu\) and variance \(\frac{\sigma^2}{n}\).
2. Central Limit Theorem: If \(n \ge 30\), \(\bar{X}\) is approximately Normal regardless of the population shape.
3. Unbiased Estimates: Use \(\bar{x}\) to estimate \(\mu\), and use the formula with \(\frac{1}{n-1}\) to estimate \(\sigma^2\).
4. Larger sample sizes (\(n\)) lead to more reliable estimates (smaller variance).