Word tables derived from corpus data - English Language (9093) - Cambridge International AS Level

Welcome to the World of Corpus Data!

Hello! Today we are going to dive into a fascinating part of English Language study: Word Tables derived from Corpus Data. Don't let the technical name scare you off! Essentially, we are going to learn how to be "language detectives." Instead of just reading one story, we use computers to look at millions of words at once to find patterns. It’s like looking at a forest from a helicopter instead of just looking at one tree.

In your Cambridge 9093 exam, you might be given a table of words and asked to analyze what they tell us about a text or a specific way people speak. Let's break down how to master this!

What is a Corpus?

Before we look at tables, we need to know where the numbers come from. A Corpus (the plural is corpora) is a massive digital collection of real-world language. It can include everything from books and news articles to transcripts of people chatting at a coffee shop.

Think of it like this: If a single book is a "photo" of language, a corpus is a "satellite map" of the entire language landscape.

Understanding Word Frequency Tables

The most common way you will see corpus data is in a Word Frequency Table. This is simply a list of words and how many times they appear in a text or collection of texts.

1. Raw Frequency

This is the actual number of times a word appears. For example, in a news article, the word "government" might appear 25 times.
Quick Tip: High-frequency words are often "function words" like the, and, to, of. These are usually less interesting to analyze than "content words" like freedom, crisis, or innovative.

2. Relative Frequency (per million words)

Sometimes, data is shown as "frequency per million words." This allows us to compare a small text to a huge one fairly. It’s like looking at a batting average in sports—it tells you how often something happens regardless of how long the game lasts.

Key Takeaway:

Always look for the Content Words (nouns, verbs, adjectives) that appear most often. They usually reveal the Subject Matter and the Tone of the text.

The "Company They Keep": Collocation

Linguist J.R. Firth famously said, "You shall know a word by the company it keeps." In corpus data, this is called Collocation.

Collocations are words that naturally pair up frequently. Example: We say "heavy rain" but we don't usually say "weighty rain." Even though the meaning is similar, the words don't "collocate" together.

Why does this matter for your exam?
If you see a table showing that the word "immigrants" often collocates with the word "flood" or "stream," it tells you that the writer is using a "water metaphor." This suggests the writer views the topic as something that needs to be controlled or feared. Data helps us prove Bias and Perspective!

N-Grams and Clusters

Sometimes a table won't just show single words, but groups of words. These are called N-Grams (or lexical clusters).
- A 2-gram is two words (e.g., "social media")
- A 3-gram is three words (e.g., "as a result")

These clusters often act like Discourse Markers. They help organize the text and show how formal or informal it is. For example, "I don't know" is a common cluster in spoken language, while "on the other hand" is common in formal essays.

Step-by-Step: How to Analyze a Word Table

Don't worry if a table looks like a lot of numbers at first! Just follow these steps:

Step 1: Identify the "Outliers"
Look for words that appear much more frequently than you’d expect for the topic. If you’re looking at an advertisement for a car and the word "family" appears more than "engine," the purpose is emotional connection, not technical specs.

Step 2: Compare and Contrast
If the exam gives you two tables (e.g., men's speech vs. women's speech, or 19th-century news vs. 21st-century news), look for the differences. What words are missing from one but present in the other?

Step 3: Link to Audience and Purpose
Always bring it back to the syllabus! Why is this word being used? Is it to persuade a specific Audience? Is it to fit the Genre conventions of a blog or a report?

Step 4: Look for Patterns in Word Classes
Are the top words mostly Adjectives (descriptive/emotive) or Verbs (action-oriented)? This tells you a lot about the text's Style.

Memory Aid: The Three C’s

When you see corpus data, remember the Three C’s:
1. Count: How often does the word appear? (Frequency)
2. Company: What words are near it? (Collocation)
3. Context: What is the text about and who is it for? (Syllabus links)

Common Mistakes to Avoid

1. Just listing numbers: Don't just say "The word 'happy' appears 10 times." Explain why that matters. Does it create a positive tone?
2. Ignoring "Function Words": While "the" and "is" are usually boring, if a text has a very high frequency of "I" and "me," it shows it is First-Person and Subjective.
3. Forgetting the "Human" element: Computers generate tables, but humans write texts. Always ask: What was the writer's Purpose?

Quick Review Box:

- Corpus: A big digital bank of language.
- Frequency: How often a word appears.
- Collocation: Words that like to hang out together.
- N-Gram: A sequence of words (clusters).
- Analysis: Link the data to Tone, Bias, Audience, and Purpose.

You've got this! Analyzing word tables is just a different way of reading. Instead of reading between the lines, you are reading between the numbers!

* The content provided by thinka is generated by AI and may not always be accurate or up-to-date. Please use it as a supplementary resource and verify with official materials.