# Part IIa: Probability and Information

Claude Shannon gave birth to the field of information theory in his seminal paper A Mathematical Theory of Communication. In it Shannon develops how data is transferred from a source to a receiver over a communication channel. Information in the theory rests in how meaning is conveyed through symbols, making some combinations of symbols appear more often than others. This allowed him to apply the frequentist interpretation of probability to word distributions, and resulted in a rich theory centered around the information measure that mathematical history will forever remember as Shannon entropy:

We will eventually get to this measure, but by a different path. Already we have discussed the limitations of the frequentist interpretation of probability, and I have stressed a more general interpretation from the school of Bayes. Shannon’s work is lacking an epistemic touch, and information seems precisely like the thing that we, as thinking beings, should be able to understand through epistemic considerations.

Last time we showed how, under very sensible assumptions, that the notion of belief can be quantified in a manner that makes it equivalent to probability theory. This foundation for the Bayesian interpretation of probability theory is the connection between epistemology and information that this post will be about.

Given a Realm of Discourse (RoD), and epistemic agent (EA) consider the act of the EA interacting with data i.e. having an experience. If the experience changes the EA, then the experience is said to be informative, and contains information. The thing that changes about the EA is its state: its belief function on the RoD. But how does it do that?

Prior to the experience, the EA has a set of beliefs: the prior belief. After the experience the EA has a new set of beliefs: the posterior belief. What distinguishes the latter from the former is the incorporation of the experience into the EA’s history. How does the posterior differ from the prior at a quantitative level?

This is where the rules of probability come in. Consider an arbitrary statement, $s$ , and a statement describing an experience, $e$ . The belief in their conjunction can be decomposed via Bayes’ rule in two different ways, $b(s\land e|H)=b(s|H)b(e|s\land H)=b(e|H)b(s|e\land H)$ In the former decomposition we can identify the prior, and in the latter the posterior. Rearranging to solve for the posterior we have:

We have introduced the likelihood function, $\mathcal L$ , which transforms the prior into the posterior through simple multiplication of real numbers. Note that the likelihood is a function of both the EA, through its dependence on history, and that of the data (experience). Let’s take a closer look at it. We rewrite the denominator as a sum of unmarginalized beliefs:

What does the likelihood mean, and why is it important? We answer the latter by noting that it is that which changes belief, and hence if we are searching for what information is, then here is where we will find it. The numerator of the likelihood is the belief in $e$ under the assumption that $s$ is accepted into the history of the EA. This is where the likelihood gains its name, the numerator is the likely belief in the experience given knowledge of the statement. The denominator is a weighted average of the likeliness of belief in the experience given knowledge of every statement. Because this is a weighted sum rather than just a sum, the likelihood is NOT normalized. Since the prior and the posterior ARE normalized, the likelihood in a sense describes a flow of belief in the RoD, causing the belief in some statements to wane and others to wax. The likelihood constrains this strengthening and weakening of belief to ensure the normalization of belief post experience, and hence keep beliefs comparable throughout the process.

Now what of information? If it is that which changes belief, and multiplication via the likelihood updates prior to posterior, then information must be related to the likelihood. For an EA, then, we can write that the information contained in an experience about a statement is $I_s(e) = h[\mathcal L(s,e|H)]$, where the likelihood is for the experience in the context of the EA’s history. We posit continuity for h without justification. We wish to explore this relationship, but to do so we really need to think a little harder on how several experiences alter belief in a stepwise fashion. Consider an EA that has two experiences, $e_1\land e_2$. Taken simultaneously as a compound experience we would have:

If we instead decompose the experiences individually, updating our prior twice, we find:

This is telling us how the likelihood for a compound experience can be decomposed into consecutive likelihoods. Note that the events have an implicit timestamp associated to them, so changing the order of experiences is inherently changing the experiences themselves, so order does matter. We look now at the idealized case where the two experiences are independent of one another; by this we mean that they are close enough to one another in time that their timestamps can be switched without altering their nature, resulting int he order independence of the compound event. In this idealization the likelihood factors as:

The independence of the the experiences means that information gained about a particular statement from one should be independent of the information gained from the other: $I_s(e_1\land e_2) = I_s(e_1)+I_s(e_2)$. Denoting $x_1=\mathcal L(s,e_1|H)$ and $x_2 = \mathcal L(s,e_2|H)$, independence results in the functional equation:

We will now show that this equation has a unique solution, which will give us the functional form relating belief with information. First consider $x_1 = x_2 = x$. One can easily see that for any positive integer $n\in \mathrm Z^+$, we have that $h(x^n) = h(x)+\cdots + h(x) = nh(x)$, which we will refer to as the power property. It must hold for $x=1$, so it must be the case that $h(1)=0$. This property has a simple interpretation. Recall that a likelihood of one leaves the posterior equivalent to the prior. It does not change belief. Such an experience contains no information. Next note that $0 = h(1) = h(xx^{-1})=h(x)+h(x^{-1})$ so that $h(x^{-1})=-h(x)$. This extends the power property to negative integers. We can extend it to the rationals by considering $0 = h(1)=h(x^\frac{n}{m})+h(x^{-\frac{n}{m}})=nh(x^\frac{1}{m})-mh(x^{-\frac{1}{n}})$. Changing variable to $x=y^m$ we find $h(y^\frac{n}{m})=\frac{n}{m}h(x)$. From the rationals to the reals simply involves invoking the denseness of belief and the continuity of h. Since the power property now applies to all reals, we can differentiate it with impunity wit respect to the exponent: $h(x)=h'(x^n)\cdot x^n\log x$. Multiplying by $n$ to absorb in the lhs, changing variables to $u = \log x$, and integrating, we find $h(x) = C\log x$ where $c$ is a constant of integration.Thus information takes the form:

The constant will be dealt with shortly, but note that it can be partially absorbed into the base of the logarithm, allowing us to express information in whichever base we choose to be convenient. Typical choices include 2 (the unit being called a bit), $\latex e&bg=000000&fg=ffffff$(nats), 3(trits), or 10(dits). The remaining freedom has to do with the sign of the constant.

The closer to 1 the likelihood is, the less information there is about the statement in the experience.  This make intuitive sense, since likelihoods close to 1 hardly change prior belief. Likelihoods that decrease the prior have negative logarithms, and those greater than 1 have positive logarithms. Hence the sign of the constant of integration depends on properly analyzing the meaning of increasing and decreasing belief. Consider then writing the likelihood in terms of the posterior and prior:

This is interesting. The information contained in the likelihood  can be written as a difference between the logarithm of the posterior and prior. We define now self-information $i(s|H) = C\log p(s|H)$. The information contained in an experience is the difference between posterior self-information and prior self-information. It is natural to assume that self-information must be non-negative, which pins down the sign of the constant of integration to being negative (since probabilities are bounded from above by 1):

This definition captures the essential property that information in an experience about a statement is related to how much the experience changes an EA’s belief in that statement.  Though it seems we are at the end of our journey in deriving an analytical form for information, it turns out that we are not. In the next post we’ll examine the issues with the above form, and use the consistency of the Realm of Discourse to deal with them and patch things up.

Till then!

# Part I: Belief and Probability

What is Information?

Is there information in a book? Is it in the pages and extracted when it is read? Why don’t two people get the same amount of information from the same book? For that matter, why do I feel informed differently when I read the same book a second time? Or if I read the same book but it’s written in another language? Why do I matter so much when I’m talking about the information contained in a book?

The first lesson of understanding information is that it is contextual.

It’s impossible to disentangle the two aspects of what make information: the thinking thing, which I will refer to as an epistemic agent (EA), and the interaction of the EA with data, which I will refer to as an experience. An EA changes, it feels informed, when it has an experience that is informative, so information must be related to experiences that change an EA.  An experience is interpreted through the lens of the EA, so information must be related to the state of the EA.

The question that must then be addressed is what is it about the EA that changes during an experience? Put more concisely, what is the state of an EA?

We can look to ourselves for guidance in this matter. Whenever I have an experience that I feel is informative, it is because the experience has made me see things in a new light. I believe in certain things more strongly, and in other things less so after the experience. My beliefs about the world, whatever that may be, change when I am informed. The more moved I am by an experience, the more change occurs to my beliefs. At the very least, for the most mundane of experiences, my belief in the statement ‘I have experienced that’ increases, all else remaining the same.

If belief is the key to defining the state of an EA, then the next question is what are these beliefs about? I have already mentioned belief in a particular statement, so let me expand on that in a rather Wittgensteinian fashion. The world is what it is, and EA’s have pictures of the world that they believe. It is important to discern between the world itself, and the picture of the world. The ontology of the former is important, but it is the epistemology of the latter that we must look to. The picture of the world is painted with language, and statements are the atoms through which the picture is generated. As an EAs beliefs in statements fluctuate, so does the picture. The picture of the world that an EA has is an evershifting mental masterpiece, updated through experiences from the blank slate of a child to the well colored vision of an adult.

To be more precise, let $\mathcal R = (S,L)$, be a Realm of Discourse (RoD) consisting of a set of statements about the world, $S$, and logical connectives, $L:S^n\rightarrow S$. The latter are n-ary maps from multiple statements to statements. The logical connectives imply that there is a subset of $S$ known as atomic statements, $S_0 \subset S$, which form the most basic things that can be said about the world within the RoD. The remaining statements are called compound statements, for obvious reasons. Some examples of logical connectives, for all $s,s'\in S$, are the unary function of negation, $\neg(s) = \neg s$, and the binary functions of conjunction and disjunction, $\land(s,s')=s\land s'$ and $\lor(s,s')=s\lor s'$. Logical connectives are the rules by which the RoD is constructed, generating all possible statements that could be made about the world by EAs. The key thing to get from the preceding is that an RoD establishes all possible pictures of the world that can be reached, and hence puts limits on what can be said. What is outside of an RoD, an EA cannot speak of.

With all possible pictures of the world in hand, what determines the particular picture of the world that an EA has, and hence what is the state of the EA? We are now back to thinking about belief, and are almost in the position to define it quantitatively. From the RoD, a picture of the world is the degree to which an EA believes in every possible statement. The sky is blue and vaccines cause autism are examples of statements that an EA has some belief in. Note that we are NEVER talking about the truth value of any particular statement, only the degree to which an EA believes it. The epistemology is ontologicaly neutral. What the world is is less important than the picture of the world that the EA has. It is the picture of the world that we must examine; to understand how it changes due to experience. Furthermore, the beliefs of an EA are history dependent, in that they are what they are due to the experiences that an EA has had. More on this in a bit.

An interesting aspect of belief, which I alluded to by describing it as a degree, is that it is transitive: if I believe that I am a human being more than that I am an animal, and I believe that I am an animal more than that I am a rock, then necessarily I believe more so that I am human being over that I am a rock. This transitivity of belief is incredibly powerful, because it means that beliefs are both comparable and ordered. I can take any two beliefs I have and compare them, and then I can say which one I believe in more. It may be difficult to compare two beliefs that are very similar, but upon close inspection it appears all but impossible to find a pair that are fundamentally incomparable. These two properties of belief form the first of Cox‘s axioms, whose work has heavily influenced my thinking.

The ordering of belief has a wonderful consequence: we can model the degree to which an EA believes in statements from the RoD by real numbers. A picture of the world is a mapping from the RoD to the continuum, assigning to each statement in the RoD a real number. Recalling the history dependence on past experience, we denote the real number that describes the belief in statement $s\in L$ by $b(s|E)$, where $E$ describes the set of all past experiences relevant to the statement $s$.

Where do the logical connectives of the RoD come into play? This brings us back to the difference between atomic and compound statements. A picture of the world is rational if the belief function on the RoD obeys the second and third of Cox’s axioms: Common Sense and Consistency. Common sense reduces the space of possible relationships between beliefs in atomic and compound statements. Consistency results in a pragmatic rescaling of the RoD map, rendering the common sense relationships in a simpler form. These rescalings are referred to as regraduations of belief, and the end result will look very familiar.

Moving forward, let’s examine what we mean by common sense. Imagine a unary connective acting on a single statement, keeping all else fixed. Common sense dictates that the belief in the transformed statement should somehow be related to the EAs belief in the original statement:

Similarly for binary connectives, belief in a compound statement should be related to the belief in the statements separately, and dependently on one another:

One could go ahead towards arbitrary n-ary connectives, but writing down these relationships becomes quite unwieldy. Fortunately all we need is contained in the relationships $f:\mathrm R\rightarrow\mathrm R$ and $g:\mathrm R^4\rightarrow\mathrm R$. In fact, we can simplify things even further by restricting ourselves to the particular RoD that has negation and conjunction connectives. In this RoD all other logical connectives can be written as combinations of these two (For example disjunction can be expressed as $s\lor s' = \neg (\neg s \land \neg s')$; learn more at propositional logic). In this RoD the common sense relationship for binary operators is trivial for many combinations of its arguments, and reduces the domain to a 2-dimensional subspace of the original domain. There is still a freedom to choose which two arguments. Given the commutativity of conjunction, either pair $(b(s|H),b(s'|s,H))$ or $(b(s'|H),b(s|s',H))$ lead to the same results, so we choose the former for simplicity.

The application of common sense has lead us to necessary relationships between the belief function which generates the EA’s picture of the world, and the logical connectives the RoD is equipped with which limit what the EA can speak of. One could have, of course, demanded more complicated relationships, or none at all, but then one would have to argue why my belief in it is raining is not at all related to by belief in it is not raining. Common sense is epistemically satisfying, and, as will be shown, incredibly powerful in its restrictiveness on the possible forms of belief.

Consistency is the final ingredient that must be incorporated into the quantification of belief. It demands that if there are multiple ways of constructing a statement in the RoD, then the common sense relationships should act in such a way that belief in the statement does not depend on the path used to get to it. For example I could decompose the compound statement $s\land s'\land s''$ as either $s\land (s'\land s'')$ or $(s\land s')\land s''$. Consistency requires that associatively equivalent decompositions lead to the same belief value. This alone is quite powerful; consider the implications for $g$ due to it. Denote $x = b(s|H)$, $y = b(s'|s,H)$, and $z=b(s''|s,s',H)$. The above mentioned associativity implies:

This is a functional equation, and it is a beast. Functional equations make differential equations look like adorable little cupcakes. Functional equations are beautiful. Solving them, if possible, requires far more dexterity at analysis than attacking other types of equations, and this is but the first functional equation that we will find on our journey towards an epistemic understanding of information.

The solution to the associativity functional equation requires showing the existence of a function $\psi$ that satisfies:

The existence proof is long and tedious, but checking that the associative functional equation is satisfied is straightforward. Once one is convinced of this, we can regraduate(rescale) beliefs with $\psi$, $b \rightarrow \psi \circ b$. Why would we do this? The only reasonable explanation is pragmatism. This regraduation transforms the relationship for conjunctions into simple multiplication, yielding:

If you just got a tingly sensation, you’re not alone…

Recalling that $\neg\neg s=s$, consistency produces a second functional equation from the first common sense relationship on unary operators:

This equation is sometimes referred to as Babbage’s equation, and has a long history. One can verify quickly that for any invertible function $\phi$, a solution to this equation is of the form:

Regraduating beliefs once more with this function, $b\rightarrow \phi\circ b$, the unary relationship becomes:

Let’s discuss these tingly sensations that we’re feeling in a moment. It’s a good idea to note that we have just derived that the belief in a negation is a monotonically decreasing function of the original belief. This means that the more an EA believes in a statement, the less they believe in the negation of the statement. Common sense, right?

One final consideration should be made for the bounds of the regraduated belief function. To find bounds we should consider statements that are purposefully complicated such as $s=s\land s$. We denote maximal belief by $b_T$, and apply the multiplication rule, $b(s|H) = b(s\land s|H) = b(s|H)b(s|s,H)=b(s|H)b_T$. Since an EAs belief in a statement that is part of their history is maximal, this implies that $b_T =1$. Furthermore, the negation of a maximal belief is minimal due to the monotonicity of the negation relationship. Denoting minimal belief as $b_F$ one can quickly show that $b_F = 0$. We have then that beliefs are real numbers in the interval $[0,1]$

To the astute reader the regraduated belief function satisfies two rules which are identical to those found in probability theory: Bayes’ Rule and the Normalization Condition:

Seeing this coincidence prompted Cox to wonder, as many have, on the nature of probability. The predominant view of what probability is stems from the frequentist school of thought. Probabilities are frequencies. When I say that a coin has a probability of landing heads of .5, one usually explains this by discussing an experiment. A coin is tossed many times (or an ensemble of identical coins are all tossed at once) and the number of heads is counted and divided by the total number of throws. The resulting number is not necessarily equal to .5, but it’s probably close, and the frequentist will then tell you that if you just performed the experiment an infinite number of times then the frequency of heads would approach one half.

That’s cute, but then a planetologist tells you that the probability of life on Mars is 42%. Do you envision an experiment where a multitude of Universes are created and the planetologist counts the number of them that have a Mars with life in them, and then divides by the total number of Universes that were created? Let’s not forget they have to keep making Universes forever if we want to apply the frequentist definition of probability. Clearly this interpretation cannot handle such a statement.

An opposing school of thought on the matter is that of Bayes. The Bayesian interpretation of probability eschews the use of frequencies, and casts probabilities as parts of the state of knowledge. Here at last we see the connection between the epistemological theory of belief developed by Cox( read his original 1946 paper Probability, Frequency, and Reasonable Expectation) and the Bayesian school of thought. Since belief obeys the same rules as probability, it is no leap to conjecture that $p=b$.

Probabilities ARE beliefs.

When the planetologist tells you that there is a 42% probability of life on Mars, they are telling you, based on their past experiences (education, research, exposure to popular culture, etc. ), how strongly they believe life exists on Mars. Their background is suited for them to have a well thought out belief, and so their statement has weight. Meteorologists do the exact same thing with the weather, which is why sometimes you’ll notice that different weather sites have slightly different probabilities for future weather patterns. These differences reflect the differences in belief that the meteorologists (Or I should say the models they’re using to analyze meteorological data) have in predicting the future based on the data that they have interacted with.

What has been done here is an exposition on the epistemic foundations of the Bayesian interpretation of probability theory. By grounding the interpretation of probability in an epistemic theory, we can now move forward with our main investigation of what information in. The tools that have been developed will help us in this journey, since now we see how an epistemic agent has an internal state that is defined by a belief function on the Realm of Discourse. This belief function creates a painting of the world which influences how the EA behaves when novel experiences are had. Beliefs can be analyzed via the rules of probability, and, in particular, how they change will lead us to an answer to the question What is Information?