Introduction to Data Compression

Have you ever tried to send a high-quality video to a friend, only to find it takes forever to upload? Or maybe your phone warned you that you're running out of storage space? This is where data compression comes to the rescue! In this chapter, we will learn how computers make files smaller so they take up less room and travel faster across the internet. Don't worry if it sounds technical—at its heart, compression is just about finding clever ways to pack information more efficiently.

What is Data Compression?

Data compression is the process of reducing the size of a file. By making files smaller, we can store more of them on our hard drives and send them over networks (like the internet) much faster.

Why do we need it?

There are two main reasons why compression is essential in Computer Science:
1. Storage Space: Smaller files mean you can fit more photos, songs, and apps on your device.
2. Transmission Speed: Smaller files take less time to download or upload. This is vital for streaming services like Netflix or YouTube so the video doesn't "buffer" constantly.

Quick Review: Compression = Smaller Files = More Space + Faster Sharing!

Huffman Coding

Huffman coding is a clever way to compress text or data by looking at how often certain characters appear. In standard ASCII, every character (like 'A', 'e', or '!') uses the same number of bits (usually 7 or 8 bits). Huffman coding realizes this is a bit wasteful. Why give a rare letter like 'Z' the same amount of space as a very common letter like 'E'?

The Big Idea

Imagine you are packing a bag. You put your most used items (like your phone) in an easy-to-reach pocket, while things you rarely use (like an umbrella) go at the very bottom. Huffman coding does the same with bits:
- Frequent characters get short binary codes.
- Rare characters get long binary codes.

Interpreting a Huffman Tree

To find the binary code for a character, we follow a Huffman Tree from the top (the root) down to the character we want.
- Every time you go Left, you add a 0.
- Every time you go Right, you add a 1.

Example: If you follow the path Left, then Right, then Left to find the letter 'A', its Huffman code would be 010.

Calculating Bit Savings

You might be asked to calculate how many bits are saved using Huffman coding compared to ASCII. Here is a step-by-step guide:
1. Calculate Uncompressed Size: Count the total characters in the string and multiply by the bits per character (usually 7 or 8).
2. Calculate Compressed Size: Count how many times each character appears, multiply by the length of its specific Huffman code, and add them all together.
3. Find the Difference: Subtract the compressed size from the uncompressed size.

Did you know? Huffman coding is used inside many common file formats you use every day, including ZIP files and JPEGs!

Key Takeaway: Huffman coding saves space by giving common data shorter bit-patterns and rare data longer ones.

Run Length Encoding (RLE)

Run Length Encoding (RLE) is a simple form of compression that works best when data has lots of repeating values. It is very common in simple bitmap images where large areas are the same color.

How RLE Works

Instead of recording every single piece of data individually, RLE records the data value and the number of times (the frequency) it repeats in a row. We call these "frequency/data pairs."

Analogy: If you were telling a friend how to paint a wall, you wouldn't say "Paint a blue pixel, then paint a blue pixel, then paint a blue pixel..." You would just say "Paint 10 blue pixels."

Representing Data in RLE

Let's look at a string of binary data:
00000 111 000000 11

In RLE, we count the "runs" of identical bits:
- There are five 0s
- There are three 1s
- There are six 0s
- There are two 1s

The RLE representation would be: 5 0 3 1 6 0 2 1

Common Mistake to Avoid!

When writing RLE pairs, always make sure you know which number is the count and which is the data. Usually, the first number is the frequency (how many) and the second is the value (what it is).

Key Takeaway: RLE reduces file size by replacing repeating sequences of data with a count and a single value.

Summary of Compression Methods

Huffman Coding: Best for data where some items appear much more often than others (like the letter 'e' in English text).
Run Length Encoding (RLE): Best for data with long "runs" of repeating values (like a simple icon with a solid background color).

Don't worry if this seems tricky at first! Just remember that the goal is always the same: finding a way to describe the same information using fewer 1s and 0s. Practice a few tree-tracing exercises and RLE conversions, and you'll be a pro in no time!

Quick Review Box

- Data Compression: Making files smaller for storage and speed.
- Huffman Coding: Uses a tree; frequent characters = short codes; rare = long codes.
- RLE: Uses frequency/data pairs; best for repeating data.
- Math Tip: To find bits in ASCII, use \( \text{Number of Characters} \times 7 \) (or 8). To find bits in Huffman, add up the bits used for each individual character.