Let’s say that we want to send a message to a friend. What’s the smallest amount of information we need to send that message? The answer to this question is related to the entropy of the alphabet we use to write the message in. This post will cover exactly how to do that.

First some definitions. A letter is a symbol, which we denote $a_i$, used to write messages. The subscript runs over all the letters in the alphabet being used, $\mathcal A = \{a_i\}_{i=1}^N$. The size of the alphabet we use is $N$ letters long. A message consists of a string of letters, $\mathcal M = a_{i_1}a_{i_2}\cdots a_{i_M}$. The size of the message is $M$ letters. To convey the message we need to store each letter and send it, so the way in which we store letters will determine the amount of storage we need to send the message.

Let’s first think through what we mean by storage. Since the alphabet has $N$ letters, each letter can be stored in either a single N-it, or in $\log N$ bits. Think of the latter as the number of yes/no questions one needs to answer to pinpoint which element of the alphabet a particular letter is (If the logarithm is fractional then we need to round up). Hence the entire message needs $M \log N$ bits to be sent. Is this the best we can do? Obviously no, otherwise this post wouldn’t exist.

The key to compressing the message is to use the entropy of the alphabet. The lower the entropy of the alphabet, the more compression we can achieve. The key is to store the letters of the alphabet so that the most common letters use up less storage (fewer bits). This is at the expense of rare letters using more space. This means that certain messages, when restored under this new scheme, will actually use up MORE storage. However, these messages are rare, and, on average, the majority of messages will take up less storage.

So, how is this new storage scheme accomplished?

First we need to find the probability of appearance of each letter, $p_i = p(a_i)$. This can be done by looking at a corpus of messages that have already been sent to see which letters are used more often than others, or through the use of an epistemic agent that starts off ignorant and observes messages, refining its belief distribution for the alphabet. Once the probabilities are known, the information content in each letter is:

The ceiling of this value is the number of bits entropic compression requires to store that particular letter. Hence, each letter can be mapped to a code word that is that many bits long. So what is the average number of bits needed to store a particular letter?

Voila! The entropy of the alphabet coincides with the average number of bits needed to store a letter. Since the entropy is maximized for an equiprobable alphabet at $S^{max} = \log N$, this scheme is will use no more space than the original. To see this we consider long messages. The total information needed to transmit them will be $M S[\mathcal A] \leq M\log_2 N$. Now, there is a small caveat here, since we did use the ceiling function, but we won’t worry about that.

So what does this mean for messages sent in English? Assume an alphabet of only 26 letters (don’t distinguish between upper and lower case and no spaces). The average number of bits needed to convey a letter is $\log_2 26\approx 4.70$. But we aren’t writing just any messages, we’re writing messages in English. I downloaded the 11th edition of Encyclopaedia Britannica, which gave me a dataset of over 40,350,587 words. Let’s take some time to discuss it, first starting with letters, and then moving up to clusters of letters and words.

Parsing up all those words into their constituent letters gave me 192,439,482 instances of letters in the English language. How often does each letter occur?As you can see ‘E’ is the winner, and vowels show up pretty often. Q’s and Z’s do not. We can calculate the entropy of English based of of this letter distribution, and find that the average information needed to store a letter is $S[\mathcal A] \approx 4.17$. Hence this scheme can compress messages by about 11%. Now to actually implement this scheme requires figuring out the code words to map each of the letters to. This can be accomplished quite elegantly using a Huffman tree, which also gives entropic compression its standard name: Huffman coding.

It should also be mentioned that this compression scheme also provides the maximum compression that can be accomplished based on the constraint imposed by letter frequencies on which we trained the epistemic agent whose beliefs we’re using. This is because the maximized entropy under the constraint is the most ignorant belief function allowed for: lowering the entropy implies that the epistemic agent is not maximally ignorant. But if the EA is not maximally ignorant beyond the information given to it by the constraint, that means that there must exist some other information (constraint) that it is using to constrain its beliefs. Since, by assumption, the frequency constraint is all the information the EA has to work with, a lower entropy implies a contradiction.

Now, this compression is great, but can we add more constraints from the English language to make it even better? What sort of constraint should we add?

If you are given a word to guess the first few letters might take around 4-5 tries. That’s what the entropy means! You’ll need to ask on average 4.17 questions to get the right letter. However, once you have the first one or two letters you’ll be able to guess the remaining letters pretty quickly. If you see a ‘q’ you’re going to guess ‘u’  is the next letter and you’ll most likely be right. That’s because English has much more structure than just the frequency of letters. Some letters tend to follow other letters more often than others. This following preference can help us construct an even better compression algorithm!

Once again I took the Encyclopaedia Britannica and did an analysis of which letters follow each other. I got the frequencies for all pairs of letters in words, and then calculated the conditional probabilities via Bayes’ Rule:

The conditional probabilities are referred to as 1-transitions, since they tell you what the probability is of transitioning to a new letter given that you have 1 letter of knowledge. The results are pretty neat:

The scale is logarithmic, so the colors go in powers of 2, and the sum across rows is normalized. The difference between red and blue is 16, so red is over 64000(2^16) times as likely as blue. Note the patterns in this plot. Consonants love to be followed by vowels. ‘B’ loves to be followed by ‘Y’ and ‘E’. ‘Q’ is almost always followed by ‘U’. There are some weird ones in there too. Why is ‘Q’ sometimes followed by ‘V’? That’s probably due to the fact that the Encyclopedia’s digital version hasn’t been finished being edited. The original was scanned and turned into letters based on the scanned shapes. In older books (and this one is from 1911) had u’s printed as v’s.

Neat, so now we have transition probabilities. How picky are the transitions for different letters? That can be answered by looking at the entropy of each of these distributions

The dashed green line is the entropy of the alphabet. Notice that all the transition entropies are lower, meaning that if you know a letter, it takes less guesses to find the next letter. For letters on the right side of the plot, much fewer guesses are needed. The letters on the left side are less picky about followers, and more guesses are needed. We can perform a Huffman encoding on each of these distributions, and further compress our English messages. As long as the decoder has this list of lists of codewords they will be able to decipher what we send them.

So, how much have we gained by looking at next letters?

The answer has to do with the conditional entropy of the joint distribution of two letters. Put plainly, the entropy of two letters minus the entropy of the first letter gives us the average number of questions we need to ask to guess the second letter if we know the first (Translation: (how many questions do you need to guess the pair of letters) – (how many questions you need to guess the first letter)):

Calculating this entropy we get $S[\mathcal A|\mathcal A] \approx 3.29$, or a 30% compression from the original. We’ve almost doubled our compression by considering which letters are more likely to follow which letters.

But why stop there?

Why not look at the probability that a letter follows a pair of letters: the 2-transition probabilities? Or the 3-transition probabilities, etc. Doing this for the data set, I found that English has  $S[\mathcal A|\mathcal A\times\mathcal A] \approx 2.82$, $S[\mathcal A|\mathcal A\times\mathcal A\times\mathcal A] \approx 2.03$, $S[\mathcal A|\mathcal A\times\mathcal A\times\mathcal A\times\mathcal A] \approx 1.19$. If we plot the entropies based on the number of constraints we’re using, we find something amazing:

The line at an entropy of 1 corresponds to one wrong guess, on average, to get the next letter right. We clearly see a linear trend in the entropies approaching zero. If the trend continues then we expect perfect guessing of what the next letter will be somewhere around 6 priors. So what does this mean about compression? Well, it means we need to spend the most bits (on average) on the first letter of each word in a message, and fewer bits on each consecutive letter in that word. We shouldn’t even send the letters beyond the 7th in long words because they’ll be easily filled in by the decoder(most of the time). Recall that with Huffman encoding we were able to reduce the size of a word in a message to  $MH \approx M\times 4.17$ bits. What is the average storage we need now?

To answer that we really need the word distribution of English: How many words of each length appear in the language? There’s a nuance here, in that we are interested in the usage of the language, and not the corpus of words that make up the language. The latter is over all possible unique words, whereas the former includes repetition. We can plot both distributions:

This is a cool diagram because it tells us that the distribution of used words pulls the average letter count per word down quite a bit. Treating words as unique entities, the mean word length of English is 8.37 letters per word, and 50% of words have less than 7.46 letters in them. Treating words as non-unique elements of messages the mean word length is 4.78 letters per word, and 50% of words have less than 3.55 letters in them. The distributions are equal at 5.19 letters per word, meaning that English words with more than 5 letters are underused, while words with 5 or less letters are overused.

Since we’re dealing with how English is being used to make messages, we should use the blue distribution to calculate the average entropy of a word. We do this by averaging the entropy needed to store a words of length l over all possible word lengths:

The result is pretty amazing.

How amazing?

Consider the original scheme of storing each letter of the alphabet using the same amount of storage. The average word entropy would be about 22.42 bits. With entropic encoding, the entropy we have as a result of the above is 11.44 bits. We’ve achieved a compression of about 50%, which basically allows us to send twice as many messages as before. Pretty neat, huh?