Assignment Description

In this lab, you will be exploring a different tree application (Huffman Trees), which allow for efficient lossless compression of files. There are a lot of files in this lab, but you will only be modifying huffman_tree.cpp.

The Huffman Encoding

In 1951, while taking an Information Theory class as a student at MIT, David A. Huffman and his classmates were given a choice by the professor Robert M. Fano: they can either take the final exam, or if they want to opt out of it they need to find the most efficient binary code. Huffman took the road less traveled and the rest they say is history.

Put simply, Huffman encoding takes in a text input and generates a binary code (a string of 0’s and 1’s) that represents that text. Let’s look at an example: Input message: “feed me more food”

Building the Huffman tree

Input: “feed me more food”

Step 1: Calculate frequency of every character in the text, and order by increasing frequency. Store in a queue.

r : 1 | d : 2 | f : 2 | m : 2 | o : 3 | 'SPACE' : 3 | e : 4

Step 2: Build the tree from the bottom up. Start by taking the two least frequent characters and merging them (create a parent node for them). Store the merged characters in a new queue:

SINGLE: f : 2 | m : 2 | o : 3 | 'SPACE' : 3 | e : 4

MERGED: rd : 3

Step 3: Repeat Step 2 this time also considering the elements in the new queue. ‘f’ and ’m’ this time are the two elements with the least frequency, so we merge them:

SINGLE: o : 3 | 'SPACE' : 3 | e : 4

MERGED: rd : 3 | fm : 4

Step 4: Repeat Step 3 until there are no more elements in the SINGLE queue, and only one element in the MERGED queue:

SINGLE: e : 4

MERGED: rd : 3 | fm : 4 | o+SPACE : 6
SINGLE:

MERGED: fm : 4 | o+SPACE : 6 | rde: 7
SINGLE:

MERGED: rde: 7 | fmo+SPACE: 10
SINGLE:

MERGED: rdefmo+SPACE: 17

From Text to Binary

Now that we built our Huffman tree, its time to see how to encode our original message “feed me more food” into binary code.

Step 1: Label the branches of the Huffman tree with a ‘0’ or ‘1’. BE CONSISTENT: in this example we chose to label all left branches with ‘0’ and all right branches with ‘1’.

Step 2: Taking one character at a time from our message, traverse the Huffman tree to find the leaf node for that character. The binary code for the character is the string of 0’s and 1’s in the path from the root to the leaf node for that character. For example: ‘f’ has the binary code: 100

So our message “feed me more food” becomes 10000000111111010011110111001000111100110110011

Efficiency of Huffman Encoding

Notice that in our Huffman tree, the more frequent a character is, the closer it is to the root, and as a result the shorter its binary code is. Can you see how this will result in compressing the encoded text?

From Binary Code to Text

We can also decode strings of 0’s and 1’s into text using our Huffman tree. What word does the code 01000011 translate to?

What About the Rest of the Alphabet?

Notice that in our example above, the Huffman tree that we built does not have all the alphabet’s letters; so while we can encode our message and some other words like “door” or “deer”, it won’t help us if we need to send a message containing a letter that’s not in the tree. For our Huffman encoding to be applicable in the real world we need to build a huffman tree that contains all the letters of the alphabet; which means instead of using “feed me more food” to build our tree, we should use a text document that contains all letters of the alphabet to build our Huffman tree. As a fun example, here is the Huffman tree that results when we use the text of the Declaration of Independence to build it.

Checking Out Your Code

To check out your files for this lab, run the following from your cs225 directory:

svn up
cd lab_huffman
make data

This will create a new folder in your working directory called lab_huffman and grab the data text files we will be dealing with.

Here is the Doxygen generated list of files and their uses.

Implement buildTree() and removeSmallest()

Your first task will be to implement the buildTree() function on a HuffmanTree. This function builds a HuffmanTree based on a collection of sorted Frequency objects. Please see the Doxygen for buildTree() for details on the algorithm. You also will probably want to consult the list of constructors for TreeNodes.

You should implement removeSmallest() first as it will help you in writing buildTree()!

Tie Breaking

To facilitate grading, make sure that when building internal nodes, the left child has the smallest frequency.

In removeSmallest(), break ties by taking the front of the singleQueue!

Implement decode()

Your next task will be using an existing HuffmanTree to decode a given binary file. You should start at the root and traverse the tree using the description given in the Doxygen. Here is the Doxygen for decode().

You will probably find the Doxygen for BinaryFileReader useful here.

We’re using a standard stringstream here to build up our output. To append characters to it, use the following syntax:

ss << myChar;

Implement writeTree() and readTree()

Finally, you will write a function used for writing HuffmanTrees to files in an efficient way, and a function to read this efficiently stored file-based representation of a HuffmanTree.

Here is the Doxygen for writeTree() and the Doxygen for readTree().

You will probably find the Doxygen for BinaryFileWriter useful here.

Testing Your Code!

We have provided you with a set of data files in the data directory you checked out. When you run make, two programs should be generated: encoder and decoder, with the following usages:

$ ./encoder
Usage:
	./encoder input output treefile
		input: file to be encoded
		output: encoded output
		treefile: compressed huffman tree for decoding

$ ./decoder
Usage:
	./decoder input treefile output
		input: file to be decoded
		treefile: compressed huffman tree to use for decoding
		output: decompressed file

Use your encoder to encode a file in the data directory, and then use your compressed file an the huffman tree it built to decode it again using the decoder. If diff-ing the files produces no output, your HuffmanTree should be working!

When testing, try using small files at first such as data/small.txt. Open it up and look inside. Imagine what the tree should look like, and see what’s happening when you run your code.

Now try running your code:

$ ./encoder data/small.txt output.dat treefile.tree
Printing generated HuffmanTree...
                                  ______________ 28 _____________
                   ______________/                               \______________
          ______ 11 _____                                                 ______ 17 _____
   ______/               \______                                   ______/               \______
s:5                           __ 6 __                         __ 8 __                         __ 9 __
                           __/       \__                   __/       \__                   __/       \__
                        y:3              3              l:4             i:4              4               :5
                                       /   \                                           /   \
                                    h:1     t:2                                      2       2
                                                                                    / \     / \
                                                                                 \n:1 r:1 o:1 a:1
Saving HuffmanTree to file...
Differing Output

It is possible to get different output than this tree and still pass catch. Use the provided test cases on catch to see if your code is passing.

You can also test under catch as usual by running:

make test && ./test

Grading Information

The following files are used to grade this assignment:

All other files, including any testing files you have added will not be used for grading.