Introduction: How Computers "Read" This Text

Have you ever wondered how a computer, which only understands binary (1s and 0s), knows that a specific pattern of electricity means the letter 'A' or a 'smiley face' emoji?
In this chapter, we explore character representation. You will learn how computers use "codebooks" called character sets to translate human language into bits and bytes. It’s like a secret language where every letter has its own unique ID number!


1. What is a Character Set?

A character set is a defined list of characters recognized by the computer hardware and software. Each character is assigned a unique code (a number). When you press a key on your keyboard, the computer doesn't see a letter; it sees that unique number represented as a bit pattern.

Analogy: Think of a character set like a restaurant menu. You might want "Pizza," but the waiter writes down "Number 14." As long as the kitchen (the computer) has the same menu, they know exactly what to make!

Key Points to Remember:

  • Every character (letter, number, or symbol) has a unique number.
  • These numbers are stored in the computer's memory as binary.

2. ASCII: The "Old School" Standard

ASCII (American Standard Code for Information Interchange) was one of the first major character sets. It is a 7-bit encoding system.

The Math: Because it uses 7 bits, it can represent \( 2^7 = 128 \) different characters.
This includes:

  • The English alphabet (uppercase and lowercase)
  • Numbers (0-9)
  • Common punctuation (like ! , . ?)
  • Control characters (invisible commands like "Enter" or "Delete")

Did you know? Even though ASCII is 7-bit, it is usually stored in an 8-bit byte, with the 8th bit left as a 0 or used for error checking.

The Problem with ASCII: It’s very limited! With only 128 codes, there isn't enough room for characters from other languages (like Arabic, Chinese, or Greek), let alone emojis.


3. Unicode and UTF-8: The Global Language

To solve the limits of ASCII, Unicode was introduced. It aims to represent every character from every language in the world.

UTF-8

UTF-8 is the most widely used version of Unicode on the internet. Here is what makes it special:

  • Variable-length: It can use 8, 16, 24, or 32 bits to represent a character.
  • Backwards Compatible: This is a very important term! It means that the first 128 codes in UTF-8 are exactly the same as the original ASCII codes.
  • Efficiency: It uses less space for English text (8 bits) but can expand to use more bits for complex symbols or other languages.

Memory Aid: Unicode is for the Universe (it covers everything!).


4. Working with Character Codes

In both ASCII and Unicode, characters are grouped together in "blocks." You don’t need to memorize the whole table, but you should know where the main groups start:

  • Numeric digits (0-9): Start at code 48.
  • Uppercase English letters (A-Z): Start at code 65.
  • Lowercase English letters (a-z): Start at code 97.

Step-by-Step: How to find a code

Don't worry if this seems like math; it's just simple counting!
If the exam tells you 'A' is code 65, and asks you for the code of 'D':
1. A = 65
2. B = 66
3. C = 67
4. D = 68

Common Mistake to Avoid: Confusing uppercase and lowercase. 'A' (65) and 'a' (97) have different codes! To a computer, they are completely different pieces of data.


5. Character Codes vs. Pure Binary

This is a tricky concept for many students, but it’s very important. The way a computer stores the number 6 is different from how it stores the character '6'.

  • Pure Binary: If you want to store the value six as a number (to do math with it), the computer stores it as \( 110_2 \).
  • Character Code: If you want to display the symbol '6' on a screen, the computer uses the ASCII/UTF-8 code. The code for the symbol '6' is 54 (which is \( 0110110_2 \) in 7-bit ASCII).

Quick Review Box:
Number 6: Binary \( 110 \)
Character '6' (ASCII): Binary \( 0110110 \)
Character '6' (UTF-8): Binary \( 00110110 \) (UTF-8 uses 8 bits for this)


Summary Checklist: Key Takeaways

- Character Set: A list of characters and their unique numerical codes.
- ASCII: 7-bit, 128 characters, only for English/basic symbols.
- Unicode (UTF-8): Variable-length, works globally, backwards compatible with ASCII.
- Groups: Digits start at 48, 'A' starts at 65, 'a' starts at 97.
- Characters vs Numbers: Storing a symbol is different from storing a value for math.