Huffman Codes

Huffman coding is a way of transmitting a message with fewer bits than you would naively assume it would take. For a lower-case English message, there are 26 characters (plus the space character). If these occured with equal frequency and with no correlations to earlier or later characters in a message, the entropy per character would be $H=\log_2 27 \approx 4.75$. English characters do not have equal frequency though (as you know if you've ever played scrabble). In 1951, David Huffman was given a problem in a graduate course to determine an efficient coding system for messages, and he came up with a tool we now call Huffman Coding, based on the frequency of different characters occuring in an average message. For an English message, characters are represented with the following strings of binary digits in Huffman's example.

Character	Code	Bits	Character	Code	Bits
space	111	3
a	1011	4	n	0111	4
b	011000	6	o	1001	4
c	00001	5	p	101000	6
d	01101	5	q	11000010101	11
e	010	3	r	0001	4
f	110011	6	s	0011	4
g	011001	6	t	1101	4
h	0010	4	u	00000	5
i	1000	4	v	1100000	7
j	1100001011	10	w	110001	6
k	11000011	8	x	110000100	10
l	10101	5	y	101001	6
m	110010	6	z	11000010100	11

I've written a little javascript code that will take an input (make sure it doesn't have punctuation) and convert it into a Huffman-coded bitstring. You should be able to play around with different inputs to see cases where Huffman coding does well, and where it does poorly. Enter something and then click

Naive Number of Bits: 0

Huffman Coded Bitstring:

Huffman Coded Number of Bits: 0