Welcome to Character Encoding! 🚀

Ever wondered how your computer knows that a specific bunch of 1s and 0s should look like the letter 'A' on your screen, or why an emoji appears when you send a text? In this chapter, we’ll explore the "secret codes" computers use to represent text. This is a fundamental part of the Data and Information section of your H2 Computing journey.

By the end of these notes, you’ll understand the differences between ASCII and Unicode and know how to use them in your own Python programs!


1. The Core Concept: How Computers "Read"

Computers are essentially giant calculators that only understand binary (0 and 1). They cannot "see" letters or symbols. To solve this, we use a Character Set—which is basically a giant lookup table (like a secret decoder ring) that assigns a unique number to every character.

Analogy: Imagine a library where every book has a specific number. Instead of asking for "The Great Gatsby," you just ask for "Book #402." Character encoding is the system that tells the computer that "65" means "A".


2. ASCII: The Foundation

ASCII stands for American Standard Code for Information Interchange. It was one of the first widely used encoding systems.

Key Facts about ASCII:

  • It uses 7 bits to represent each character.
  • With 7 bits, it can represent \(2^7 = 128\) unique characters.
  • These 128 characters include:
    • Uppercase letters (A-Z)
    • Lowercase letters (a-z)
    • Digits (0-9)
    • Punctuation marks (., !, ?, etc.)
    • Control characters (like "Enter" or "Backspace")

Using ASCII in Programs (Syllabus 3.2.2)

In Python, you can easily switch between a character and its ASCII (integer) value using two built-in functions. Don't worry if this seems tricky at first—you just need to remember these two tools:

1. ord(): Converts a character to its integer code.

Example: ord('A') returns 65.

2. chr(): Converts an integer code back into a character.

Example: chr(65) returns 'A'.

Quick Review Table:

Character: 'A' → ASCII: 65
Character: 'B' → ASCII: 66
Character: 'a' → ASCII: 97
Character: '0' → ASCII: 48

Common Mistake to Avoid: Remember that the character '0' (the text) is not the same as the number 0. In ASCII, the character '0' is represented by the number 48!

Key Takeaway:

ASCII is a simple 7-bit system (often stored in 8 bits/1 byte) that covers basic English characters but is limited to only 128 possibilities.


3. Unicode: The Universal Standard

As computing went global, 128 characters weren't enough. How do we represent Mandarin, Arabic, mathematical symbols, or the "Face with Tears of Joy" emoji 😂?

Unicode was created to solve this. It is a much larger character set that aims to include every character from every language in the world.

Examples of Where Unicode is Used (Syllabus 3.2.1):

  • Modern Web Browsers: Almost every website you visit uses Unicode (specifically a format called UTF-8) so that text displays correctly in any language.
  • Operating Systems: Windows, macOS, and Linux use Unicode to handle filenames and system text.
  • Social Media and Messaging: When you send an Emoji on WhatsApp or Telegram, you are using Unicode. Each emoji has a specific Unicode "code point."
  • International Software: Any app that allows you to switch languages (e.g., from English to Japanese) relies on Unicode to swap out the character sets seamlessly.

Did you know?

The first 128 characters of Unicode are exactly the same as ASCII. This makes Unicode "backward compatible," meaning old ASCII files can be read by modern Unicode systems without any issues!

Key Takeaway:

Unicode is the "big brother" of ASCII. It uses more bits (usually 16 or 32 bits) to represent millions of possible characters, making it the standard for our modern, globalized digital world.


4. Summary Comparison

ASCII:

  • Size: 7 bits (128 characters).
  • Scope: English only.
  • Usage: Simple legacy systems, basic programming exercises.

Unicode:

  • Size: Variable (usually 16 to 32 bits).
  • Scope: Global (All languages + Emojis).
  • Usage: The internet, modern apps, global communication.

Quick Check: Test Yourself!

1. If ord('C') is 67, what is chr(68)?
(Answer: 'D')

2. Why can't we use ASCII to write a document in Tamil or Chinese?
(Answer: ASCII only has 128 slots, which are mostly filled by English characters and symbols. It doesn't have "room" for other languages.)

3. Give one real-world example of Unicode in your daily life.
(Answer: Using emojis in a text message or viewing a website in a foreign language.)

Pro-tip for the exam: If a question asks why Unicode is preferred over ASCII, always mention "Internationalisation" or "Support for a wider range of characters/languages."