Dear reader, you know what's way more fun than feeling sad about the nature of the cosmos? Data compression, that's what! Suppose you want to send a message to your friends in a nearby alternate universe, but interuniversal communication bandwidth is very expensive (different universes can't physically interact, so we and our alternate-universe analogues can only communicate by mutually inferring what the other party must be saying, which takes monstrous amounts of computing power and is not cheap), so you need to make your message as brief as possible. Note that 'brief' doesn't just have to do with how long your message is in natural language, it also has to do with how that message is represented over the transuniveral communication channel: indeed, the more efficient the encoding, the more you can afford to say on a fixed budget.

The classic ASCII encoding scheme uses seven bits to represent each character. (Seven?—you ask perplexedly, surely you mean eight? Apparently it was seven in the original version.) Can we do better? Well ... ASCII has a lot of stuff that arguably you don't need that badly. Really, upper and lower case letters? Ampersands, asterisks, backslashes? And don't get me started about those unprintable control characters! If we restrict our message to just the uncased alphabet A through Z plus space and a few punctuation marks, then we can encode our message using only a 32 (= 25) character set, at five bits per character.

Can we do better? Seemingly not—24 = 16 isn't a big enough character set to cover the alphabet. Unless ...

Unless we abandon the assumption that each character needs to be represented by the same number of bits!

Adopting such a variable-length character encoding scheme couldn't help us if all possible messages were equally likely, but this is manifestly not true of the sort of messages anyone would actually want to send: the letter e is more common than the letter q; the word the more common than the string zkb. We can take advantage of the regularities in the sort of messages people would actually want to send by adopting a variable-length encoding scheme that assigns shorter codes to characters that occur more frequently. But then since we can no longer rely on each n bits (for some fixed n) representing one character, we'll also want our encoding scheme to be prefix-free: no character code should be a prefix of any other. That way, our message is unambiguous: if the message starts with (say) 010, which represents (say) e, then we know that the first character is e, because there won't be any other characters with codes that start with 010 (like say 01011).

Consider the process of decoding a prefix-free code: we read bits from the start of the message until we recognize a valid character code. We can visualize this graphically in the form of a binary tree with the characters in our character set at the leaves: start at the root, and then go the the left child if the first bit is 0, and the right child if it's 1, and so on until you reach a leaf, the path taken representing the code for the character at that leaf. The task of constructing such an optimal such tree (and thereby, an optimal prefix-free code) given the relative frequencies of the characters (either in a particular message, or a general class of messages that people would actually want to send) is solved by a algorithm due to David A. Huffman, which we'll implement here in Python.

First, we'll need a data structure to represent the nodes of the tree. Actually, we'll use a structure that represents a node and all of its children, which we'll call Subtree. Each instance of Subtree will have fields for a character and frequency associated with a node, and the node's left and right children if any (which are also instances of Subtree).

class Subtree:
    def __init__(self, char, freq, left, right):
        self.char = char
        self.freq = freq
        self.left = left
        self.right = right

The leaf nodes of our tree will have the character field set to a character in our set, the frequency field set to the frequency of that character, and the left and right children fields set to None. The internal nodes of the tree will have the character field set to None and the frequency field set to the sum of the frequencies of its children.

We'll also want nodes to be comparable by their frequencies:

    def __gt__(self, othernode):
        return self.freq > othernode.freq

When all is said and done, Huffman's algorithm will give us an instance of Subtree representing our tree, but it would really be more convenient to have the decoding information embodied by the tree in the form of Python dictionary mapping bitstrings to the characters they encode. We can get this by doing a standard recursive tree walk:

    def codebook(self):
        codes = {}
        def traversal(item, code):
            if item != None:
                traversal(item.left, code+'0')
                if item.char != None:
                    codes[item.char] = code
                traversal(item.right, code+'1')
        traversal(self, '')
        return codes

We're also going to need a min-priority queue, a dynamic set from which we can insert items and retrieve the "smallest":

class MinPriorityQueue:
    def __init__(self):
        self.queue = []

    def put(self, item):

    def get(self):
        return self.queue.pop()

Now that we have the appropriate data structures, we're ready for Huffman's algorithm. Suppose our character/frequency data are given to us in a Python dictionary C. For each character, we create an instance of Subtree to be the corresponding leaf node in our tree, and put them all in a min-priority queue. Then we build our tree from the bottom up by selecting the two lowest-frequency nodes in the queue, making them the children of a new node whose frequency is the sum of their frequencies, putting the new node back in the priority queue, and again until we've built the entire tree:

def Huffman(C):
    Q = MinPriorityQueue()
    leaves = {Subtree(k, C[k], None, None) for k in C}
    for leaf in leaves:
    for i in range(len(C)-1):
        left = Q.get()
        right = Q.get()
        new_node = Subtree(None, left.freq + right.freq, left, right)
    return Q.get().codebook()

And that's Huffman's algorithm. Of course, we'll also want functions for encoding and decoding a message:

def code(plaintext, codebook):
    return ''.join(codebook[c] for c in plaintext)

def decode(ciphertext, codebook):
    decodebook = {v:k for k, v in codebook.items()}
    codeword = ''
    plaintext = ''
    for i in range(len(ciphertext)):
        codeword += ciphertext[i]
        if codeword in decodebook:
            plaintext += decodebook[codeword]
            codeword = ''
    return plaintext

So, suppose you want to send your friends in a nearby alternate universe the eighty-two character message, "I USED TO WONDER WHAT FRIENDSHIP COULD BE, UNTIL YOU ALL SHARED ITS MAGIC WITH ME." If you used seven-bit ASCII, you'd have to pay the cost of transmitting 82*7 = 574 bits. But if you use a Huffman code informed by knowledge of English letter frequencies ...

eng_freqs = {'A' : 8167, 'B' : 1492, 'C' : 2782, 'D' : 4253, 'E' : 12702,
'F' : 2228, 'G' : 2015, 'H' : 6094, 'I' : 6966, 'J' : 153, 'K' : 772,
'L' : 4025, 'M' : 2406, 'N' : 6749, 'O' : 7507, 'P' : 1929, 'Q' : 95,
 'R' : 5987, 'S' : 6327, 'T' : 9056, 'U' : 2758, 'V' : 978, 'W': 2360,
'X' : 150, 'Y' : 1974, 'Z' : 74, ' ' : 13000, '.': 4250, ',': 4250}
eng_codebook = Huffman(eng_freqs)
cipher = code(plain, eng_codebook)

You can send this instead:


—which is only 353 bits, for a savings of 38.5%.


Cormen et al., Introduction to Algorithms, third ed'n, §16.3

Leave a Reply

Your email address will not be published. Required fields are marked *