Claude Shannon gave birth to the field of information theory in his seminal paper A Mathematical Theory of Communication. In it Shannon develops how data is transferred from a source to a receiver over a communication channel. Information in the theory rests in how meaning is conveyed through symbols, making some combinations of symbols appear more often than others. This allowed him to apply the frequentist interpretation of probability to word distributions, and resulted in a rich theory centered around the information measure that mathematical history will forever remember as Shannon entropy:

We will eventually get to this measure, but by a different path. Already we have discussed the limitations of the frequentist interpretation of probability, and I have stressed a more general interpretation from the school of Bayes. Shannon’s work is lacking an epistemic touch, and information seems precisely like the thing that we, as thinking beings, should be able to understand through epistemic considerations.

Last time we showed how, under very sensible assumptions, that the notion of belief can be quantified in a manner that makes it equivalent to probability theory. This foundation for the Bayesian interpretation of probability theory is the connection between epistemology and information that this post will be about.

Given a Realm of Discourse (RoD), and epistemic agent (EA) consider the act of the EA interacting with data i.e. having an experience. If the experience changes the EA, then the experience is said to be informative, and contains information. The thing that changes about the EA is its state: its belief function on the RoD. But how does it do that?

Prior to the experience, the EA has a set of beliefs: the prior belief. After the experience the EA has a new set of beliefs: the posterior belief. What distinguishes the latter from the former is the incorporation of the experience into the EA’s history. How does the posterior differ from the prior at a quantitative level?

This is where the rules of probability come in. Consider an arbitrary statement, $s$ , and a statement describing an experience, $e$ . The belief in their conjunction can be decomposed via Bayes’ rule in two different ways, $b(s\land e|H)=b(s|H)b(e|s\land H)=b(e|H)b(s|e\land H)$ In the former decomposition we can identify the prior, and in the latter the posterior. Rearranging to solve for the posterior we have:

We have introduced the likelihood function, $\mathcal L$ , which transforms the prior into the posterior through simple multiplication of real numbers. Note that the likelihood is a function of both the EA, through its dependence on history, and that of the data (experience). Let’s take a closer look at it. We rewrite the denominator as a sum of unmarginalized beliefs:

What does the likelihood mean, and why is it important? We answer the latter by noting that it is that which changes belief, and hence if we are searching for what information is, then here is where we will find it. The numerator of the likelihood is the belief in $e$ under the assumption that $s$ is accepted into the history of the EA. This is where the likelihood gains its name, the numerator is the likely belief in the experience given knowledge of the statement. The denominator is a weighted average of the likeliness of belief in the experience given knowledge of every statement. Because this is a weighted sum rather than just a sum, the likelihood is NOT normalized. Since the prior and the posterior ARE normalized, the likelihood in a sense describes a flow of belief in the RoD, causing the belief in some statements to wane and others to wax. The likelihood constrains this strengthening and weakening of belief to ensure the normalization of belief post experience, and hence keep beliefs comparable throughout the process.

Now what of information? If it is that which changes belief, and multiplication via the likelihood updates prior to posterior, then information must be related to the likelihood. For an EA, then, we can write that the information contained in an experience about a statement is $I_s(e) = h[\mathcal L(s,e|H)]$, where the likelihood is for the experience in the context of the EA’s history. We posit continuity for h without justification. We wish to explore this relationship, but to do so we really need to think a little harder on how several experiences alter belief in a stepwise fashion. Consider an EA that has two experiences, $e_1\land e_2$. Taken simultaneously as a compound experience we would have:

If we instead decompose the experiences individually, updating our prior twice, we find:

This is telling us how the likelihood for a compound experience can be decomposed into consecutive likelihoods. Note that the events have an implicit timestamp associated to them, so changing the order of experiences is inherently changing the experiences themselves, so order does matter. We look now at the idealized case where the two experiences are independent of one another; by this we mean that they are close enough to one another in time that their timestamps can be switched without altering their nature, resulting int he order independence of the compound event. In this idealization the likelihood factors as:

The independence of the the experiences means that information gained about a particular statement from one should be independent of the information gained from the other: $I_s(e_1\land e_2) = I_s(e_1)+I_s(e_2)$. Denoting $x_1=\mathcal L(s,e_1|H)$ and $x_2 = \mathcal L(s,e_2|H)$, independence results in the functional equation:

We will now show that this equation has a unique solution, which will give us the functional form relating belief with information. First consider $x_1 = x_2 = x$. One can easily see that for any positive integer $n\in \mathrm Z^+$, we have that $h(x^n) = h(x)+\cdots + h(x) = nh(x)$, which we will refer to as the power property. It must hold for $x=1$, so it must be the case that $h(1)=0$. This property has a simple interpretation. Recall that a likelihood of one leaves the posterior equivalent to the prior. It does not change belief. Such an experience contains no information. Next note that $0 = h(1) = h(xx^{-1})=h(x)+h(x^{-1})$ so that $h(x^{-1})=-h(x)$. This extends the power property to negative integers. We can extend it to the rationals by considering $0 = h(1)=h(x^\frac{n}{m})+h(x^{-\frac{n}{m}})=nh(x^\frac{1}{m})-mh(x^{-\frac{1}{n}})$. Changing variable to $x=y^m$ we find $h(y^\frac{n}{m})=\frac{n}{m}h(x)$. From the rationals to the reals simply involves invoking the denseness of belief and the continuity of h. Since the power property now applies to all reals, we can differentiate it with impunity wit respect to the exponent: $h(x)=h'(x^n)\cdot x^n\log x$. Multiplying by $n$ to absorb in the lhs, changing variables to $u = \log x$, and integrating, we find $h(x) = C\log x$ where $c$ is a constant of integration.Thus information takes the form:

The constant will be dealt with shortly, but note that it can be partially absorbed into the base of the logarithm, allowing us to express information in whichever base we choose to be convenient. Typical choices include 2 (the unit being called a bit), $\latex e&bg=000000&fg=ffffff$(nats), 3(trits), or 10(dits). The remaining freedom has to do with the sign of the constant.

The closer to 1 the likelihood is, the less information there is about the statement in the experience.  This make intuitive sense, since likelihoods close to 1 hardly change prior belief. Likelihoods that decrease the prior have negative logarithms, and those greater than 1 have positive logarithms. Hence the sign of the constant of integration depends on properly analyzing the meaning of increasing and decreasing belief. Consider then writing the likelihood in terms of the posterior and prior:

This is interesting. The information contained in the likelihood  can be written as a difference between the logarithm of the posterior and prior. We define now self-information $i(s|H) = C\log p(s|H)$. The information contained in an experience is the difference between posterior self-information and prior self-information. It is natural to assume that self-information must be non-negative, which pins down the sign of the constant of integration to being negative (since probabilities are bounded from above by 1):

This definition captures the essential property that information in an experience about a statement is related to how much the experience changes an EA’s belief in that statement.  Though it seems we are at the end of our journey in deriving an analytical form for information, it turns out that we are not. In the next post we’ll examine the issues with the above form, and use the consistency of the Realm of Discourse to deal with them and patch things up.

Till then!