Huffman coding is a way of transmitting a message with fewer bits than you
would naively assume it would take. For a lower-case English message, there
are 26 characters (plus the space character). If these occured with equal
frequency and with no correlations to earlier or later characters in a message,
the entropy per character would be $H=\log_2 27 \approx 4.75$. English
characters do not have equal frequency though (as you know if you've ever
played scrabble). In 1951, David Huffman was given a problem in a graduate
course to determine an efficient coding system for messages, and he came up
with a tool we now call
Huffman Coding, based
on the frequency of different characters occuring in an average message. For
an English message, characters are represented with the following strings of
binary digits in Huffman's example.
Character | Code | Bits | Character | Code | Bits |
space | 111 | 3 | | | |
a | 1011 | 4 | n | 0111 | 4 |
b | 011000 | 6 | o | 1001 | 4 |
c | 00001 | 5 | p | 101000 | 6 |
d | 01101 | 5 | q | 11000010101 | 11 |
e | 010 | 3 | r | 0001 | 4 |
f | 110011 | 6 | s | 0011 | 4 |
g | 011001 | 6 | t | 1101 | 4 |
h | 0010 | 4 | u | 00000 | 5 |
i | 1000 | 4 | v | 1100000 | 7 |
j | 1100001011 | 10 | w | 110001 | 6 |
k | 11000011 | 8 | x | 110000100 | 10 |
l | 10101 | 5 | y | 101001 | 6 |
m | 110010 | 6 | z | 11000010100 | 11 |
I've written a little javascript code that will take an input (make sure it doesn't have punctuation) and convert it
into a Huffman-coded bitstring. You should be able to play around with different inputs to see cases where Huffman
coding does well, and where it does poorly. Enter something and then click
Naive Number of Bits: 0
Huffman Coded Bitstring:
Huffman Coded Number of Bits: 0