# 1-d Ising Model: Transfer Matrix Method

In this post we will develop the transfer matrix method for solving the one dimensional Ising model. Hopefully the post will be pedagogical enough so that beginners can use it as a launching point into their investigations of more interesting systems.

We begin with a one-dimensional grid with $N$ vertices. The state of the $i^{th}$ vertex is denoted $\sigma_i\in \{1,-1\}$. We impose periodic boundary conditions so that $\sigma_{N+1} = \sigma_1$. The Hamiltonian can be written in a symmetric form:

The components of the transfer matrix are defined as

so that the transfer matrix reads

Some of you may now be thinking “why in the world did we decide to do that?” I apologize for the preemptive definition without any form of motivation, but bear with me. All we need to do is look at the partition function for this system to see the utility of the transfer matrix:

Voila. The partition function can be written as the trace of the transfer matrix exponentiated to the power of the number of vertices. Why is recasting the partition function in the language of linear algebra important here?  Well, because we can now use our arsenal of tricks to simplify the trace. If we can diagonalize the transfer matrix, $T = S^{-1}\Lambda S$, then, using the cyclic property of the trace, we can write:

where the $\lambda_i$ are the eigenvalues of the transfer matrix, ordered so that $\lambda_1 \ge \lambda_2$. The eigenvalues are the solutions to the characteristic equation, which for a $2\times 2$ matrix reads:

where $\text{Det}$ is the determinant. The two solutions are

Here’s a visualization of both of them. Clearly, the eigenvalue with the positive sign is greater than the other for all values of the coupling and the magnetic field (Can you prove this?). Hence it  plays the role of $\lambda_1=\lambda_+$. In the thermodynamic limit, we have then that:

I can’t stress how nice it is that we can derive an expression for the partition function here. For the majority of problems, calculating the partition function is not possible, so this example really is a very special case from which we can gain a lot of insight. We can instantly read off thermodynamic quantities by taking derivatives. In particular we find that the mean magnetization is:

Since this drops to 0 when the external field vanishes, we see that there is no spontaneous magnetization in the 1 dimensional Ising model, a fact originally discovered by Ising himself. Unfortunately, Ising then went on to assume that this means that there is no phase transition in the model in any number of dimensions, which is not true. You can read more about the history of how the critical temperature of the phase transition in 2-dimensions was discovered over the course of the three decades that followed Ising’s analysis of the one dimensional model here.

# Entropic Compression and English

Let’s say that we want to send a message to a friend. What’s the smallest amount of information we need to send that message? The answer to this question is related to the entropy of the alphabet we use to write the message in. This post will cover exactly how to do that.

First some definitions. A letter is a symbol, which we denote $a_i$, used to write messages. The subscript runs over all the letters in the alphabet being used, $\mathcal A = \{a_i\}_{i=1}^N$. The size of the alphabet we use is $N$ letters long. A message consists of a string of letters, $\mathcal M = a_{i_1}a_{i_2}\cdots a_{i_M}$. The size of the message is $M$ letters. To convey the message we need to store each letter and send it, so the way in which we store letters will determine the amount of storage we need to send the message.

Let’s first think through what we mean by storage. Since the alphabet has $N$ letters, each letter can be stored in either a single N-it, or in $\log N$ bits. Think of the latter as the number of yes/no questions one needs to answer to pinpoint which element of the alphabet a particular letter is (If the logarithm is fractional then we need to round up). Hence the entire message needs $M \log N$ bits to be sent. Is this the best we can do? Obviously no, otherwise this post wouldn’t exist.

The key to compressing the message is to use the entropy of the alphabet. The lower the entropy of the alphabet, the more compression we can achieve. The key is to store the letters of the alphabet so that the most common letters use up less storage (fewer bits). This is at the expense of rare letters using more space. This means that certain messages, when restored under this new scheme, will actually use up MORE storage. However, these messages are rare, and, on average, the majority of messages will take up less storage.

So, how is this new storage scheme accomplished?

First we need to find the probability of appearance of each letter, $p_i = p(a_i)$. This can be done by looking at a corpus of messages that have already been sent to see which letters are used more often than others, or through the use of an epistemic agent that starts off ignorant and observes messages, refining its belief distribution for the alphabet. Once the probabilities are known, the information content in each letter is:

The ceiling of this value is the number of bits entropic compression requires to store that particular letter. Hence, each letter can be mapped to a code word that is that many bits long. So what is the average number of bits needed to store a particular letter?

Voila! The entropy of the alphabet coincides with the average number of bits needed to store a letter. Since the entropy is maximized for an equiprobable alphabet at $S^{max} = \log N$, this scheme is will use no more space than the original. To see this we consider long messages. The total information needed to transmit them will be $M S[\mathcal A] \leq M\log_2 N$. Now, there is a small caveat here, since we did use the ceiling function, but we won’t worry about that.

So what does this mean for messages sent in English? Assume an alphabet of only 26 letters (don’t distinguish between upper and lower case and no spaces). The average number of bits needed to convey a letter is $\log_2 26\approx 4.70$. But we aren’t writing just any messages, we’re writing messages in English. I downloaded the 11th edition of Encyclopaedia Britannica, which gave me a dataset of over 40,350,587 words. Let’s take some time to discuss it, first starting with letters, and then moving up to clusters of letters and words.

Parsing up all those words into their constituent letters gave me 192,439,482 instances of letters in the English language. How often does each letter occur?As you can see ‘E’ is the winner, and vowels show up pretty often. Q’s and Z’s do not. We can calculate the entropy of English based of of this letter distribution, and find that the average information needed to store a letter is $S[\mathcal A] \approx 4.17$. Hence this scheme can compress messages by about 11%. Now to actually implement this scheme requires figuring out the code words to map each of the letters to. This can be accomplished quite elegantly using a Huffman tree, which also gives entropic compression its standard name: Huffman coding.

It should also be mentioned that this compression scheme also provides the maximum compression that can be accomplished based on the constraint imposed by letter frequencies on which we trained the epistemic agent whose beliefs we’re using. This is because the maximized entropy under the constraint is the most ignorant belief function allowed for: lowering the entropy implies that the epistemic agent is not maximally ignorant. But if the EA is not maximally ignorant beyond the information given to it by the constraint, that means that there must exist some other information (constraint) that it is using to constrain its beliefs. Since, by assumption, the frequency constraint is all the information the EA has to work with, a lower entropy implies a contradiction.

Now, this compression is great, but can we add more constraints from the English language to make it even better? What sort of constraint should we add?

If you are given a word to guess the first few letters might take around 4-5 tries. That’s what the entropy means! You’ll need to ask on average 4.17 questions to get the right letter. However, once you have the first one or two letters you’ll be able to guess the remaining letters pretty quickly. If you see a ‘q’ you’re going to guess ‘u’  is the next letter and you’ll most likely be right. That’s because English has much more structure than just the frequency of letters. Some letters tend to follow other letters more often than others. This following preference can help us construct an even better compression algorithm!

Once again I took the Encyclopaedia Britannica and did an analysis of which letters follow each other. I got the frequencies for all pairs of letters in words, and then calculated the conditional probabilities via Bayes’ Rule:

The conditional probabilities are referred to as 1-transitions, since they tell you what the probability is of transitioning to a new letter given that you have 1 letter of knowledge. The results are pretty neat:

The scale is logarithmic, so the colors go in powers of 2, and the sum across rows is normalized. The difference between red and blue is 16, so red is over 64000(2^16) times as likely as blue. Note the patterns in this plot. Consonants love to be followed by vowels. ‘B’ loves to be followed by ‘Y’ and ‘E’. ‘Q’ is almost always followed by ‘U’. There are some weird ones in there too. Why is ‘Q’ sometimes followed by ‘V’? That’s probably due to the fact that the Encyclopedia’s digital version hasn’t been finished being edited. The original was scanned and turned into letters based on the scanned shapes. In older books (and this one is from 1911) had u’s printed as v’s.

Neat, so now we have transition probabilities. How picky are the transitions for different letters? That can be answered by looking at the entropy of each of these distributions

The dashed green line is the entropy of the alphabet. Notice that all the transition entropies are lower, meaning that if you know a letter, it takes less guesses to find the next letter. For letters on the right side of the plot, much fewer guesses are needed. The letters on the left side are less picky about followers, and more guesses are needed. We can perform a Huffman encoding on each of these distributions, and further compress our English messages. As long as the decoder has this list of lists of codewords they will be able to decipher what we send them.

So, how much have we gained by looking at next letters?

The answer has to do with the conditional entropy of the joint distribution of two letters. Put plainly, the entropy of two letters minus the entropy of the first letter gives us the average number of questions we need to ask to guess the second letter if we know the first (Translation: (how many questions do you need to guess the pair of letters) – (how many questions you need to guess the first letter)):

Calculating this entropy we get $S[\mathcal A|\mathcal A] \approx 3.29$, or a 30% compression from the original. We’ve almost doubled our compression by considering which letters are more likely to follow which letters.

But why stop there?

Why not look at the probability that a letter follows a pair of letters: the 2-transition probabilities? Or the 3-transition probabilities, etc. Doing this for the data set, I found that English has  $S[\mathcal A|\mathcal A\times\mathcal A] \approx 2.82$, $S[\mathcal A|\mathcal A\times\mathcal A\times\mathcal A] \approx 2.03$, $S[\mathcal A|\mathcal A\times\mathcal A\times\mathcal A\times\mathcal A] \approx 1.19$. If we plot the entropies based on the number of constraints we’re using, we find something amazing:

The line at an entropy of 1 corresponds to one wrong guess, on average, to get the next letter right. We clearly see a linear trend in the entropies approaching zero. If the trend continues then we expect perfect guessing of what the next letter will be somewhere around 6 priors. So what does this mean about compression? Well, it means we need to spend the most bits (on average) on the first letter of each word in a message, and fewer bits on each consecutive letter in that word. We shouldn’t even send the letters beyond the 7th in long words because they’ll be easily filled in by the decoder(most of the time). Recall that with Huffman encoding we were able to reduce the size of a word in a message to  $MH \approx M\times 4.17$ bits. What is the average storage we need now?

To answer that we really need the word distribution of English: How many words of each length appear in the language? There’s a nuance here, in that we are interested in the usage of the language, and not the corpus of words that make up the language. The latter is over all possible unique words, whereas the former includes repetition. We can plot both distributions:

This is a cool diagram because it tells us that the distribution of used words pulls the average letter count per word down quite a bit. Treating words as unique entities, the mean word length of English is 8.37 letters per word, and 50% of words have less than 7.46 letters in them. Treating words as non-unique elements of messages the mean word length is 4.78 letters per word, and 50% of words have less than 3.55 letters in them. The distributions are equal at 5.19 letters per word, meaning that English words with more than 5 letters are underused, while words with 5 or less letters are overused.

Since we’re dealing with how English is being used to make messages, we should use the blue distribution to calculate the average entropy of a word. We do this by averaging the entropy needed to store a words of length l over all possible word lengths:

The result is pretty amazing.

How amazing?

Consider the original scheme of storing each letter of the alphabet using the same amount of storage. The average word entropy would be about 22.42 bits. With entropic encoding, the entropy we have as a result of the above is 11.44 bits. We’ve achieved a compression of about 50%, which basically allows us to send twice as many messages as before. Pretty neat, huh?

# The Ising Model: A Graph Theoretic Introduction

In this post I will introduce the Ising model from a graph theoretic point of view, without resorting to a particular graph structure on which the system lives. I will explore more specific systems in a future post.

Consider an undirected graph $G=\langle V,E \rangle$ where $V$ is a set of vertices, and $E$ is a set of edges. A state of the graph, $\sigma_G$, is a mapping of the vertices, $\sigma: V\rightarrow \{-1,1\}: v_i \mapsto \sigma_i$, and the edges, $\sigma: E\rightarrow \{-1,1\}: e_{ij}\mapsto \sigma_{ij} = \sigma_i\sigma_j$, where i and j are the vertices that bound the edge. This definition makes edges that bound vertices of the same state have a state of +1, and edges that bound vertices of opposite states have a state of -1.

Since each vertex can be labeled in one of two ways, and the state of the edges is fully determined by the state of the vertices, there are a total of $2^{|V|}$ possible states for the graph, where $|\cdot|$ is the cardinality operator. We bundle these together and write $\Sigma_G = \{\sigma_G\}$.

We want to compare different states to one another, and the easiest way to do this is to define a Hamiltonian. A Hamiltonian is a function that maps states to real numbers, which we call energies. It consists of a piece that acts on the edges, and a piece that acts on the vertices:

where the f’s are coupling functions. In the simplest case we choose $f_0(x)=hx$, and $f_1(x) = -Jx$. In the physics parlance, h is referred to as the magnetic field, and J the coupling. Plugging in the definitions for the state of vertices and edges, the form of the Hamiltonian that we will be examining reads:

where the second sum on the first line is over unique pairs of vertices, and on the second line over the neighbors of the ith vertex. The factor of a half comes from overcounting each vertex twice in the complete expression. The reason for the negative sign is that it ranks states that are more homogeneous lower, so we can say that homogeneity decreases energy.

We can make the notion of homogeneity more precise by introducing an order parameter, called the magnetization, describing the average vertex state:

Note that this is a number between 1 and -1, inclusive. There are many states that have the same order parameter. So many, in fact, that’s it’s best to talk about the logarithm of the number, which turns out to be (for very large graphs):

Here I used Stirling’s approximation for the factorials in the binomial coefficient. This function is peaked close to 0, especially for large graphs. The majority of the states have little or no magnetization; in fact if we expand around 0 magnetization we see that the states are distributed as a Gaussian:

Here I have included the quartic term in the Gaussian, which can be neglected. To see this we plot the distribution below for graphs with 10,100,and 1000 vertices. We see how quickly the distribution becomes sharply peaked at vanishing magnetization:

What does this mean?

We imagine that the system has some sort of dynamics: that it can transition from one state to another state. Under no other constraints, the system will soon find itself in a state with little or no magnetization.

For something more interesting to occur, the dynamics of how the system transitions from one state to another must be constrained by more than simply the available volume of phase space. Here is where we finally get to the physics of the Ising model. The constraint we need to incorporate is that the energy of the system is constant, so that we can say what the probability of each state is: $p(\{\sigma\})$. We resort to a MaxEnt analysis: the distribution of states under the constraint of constant energy is the one that maximizes the Shannon entropy of the system:

Two Lagrange multipliers have been introduced to enforce the normalization and energy constraints on the system. The MaxEnt method produces the belief distribution that an epistemic agent should have about the system knowing nothing more than the energy constraint. Any deviation from this distribution implies that the EA has more information at their disposal. After a little bit of calculus, and a little redefinition of the Lagrange multipliers, one can find:

This is just the standard Maxwell Boltzmann distribution, and Z is called the partition function (normalization constraint).

I do want to discuss a bit the assumptions that went into this derivation. The actual energy of our system is the Hamiltonian, but since an epistemic agent cannot know the actual state of the system, it must resort to averaging over all possible states to define the energy. If the actual system fluctuates a bit, this means that the actual energy is a time average of the energies the system fluctuates through. The statement that this temporal average is equivalent to the average over the ensemble of all possible states is known as the ergodic hypothesis. This is only possible because we assume the energy is constant (in the temporal average), which means that the system is in equilibrium. If the system was out of equilibrium, the ergodic hypothesis would no longer hold, and the epistemic agent would not have the belief distribution about the system derived above. Out of equilibrium dynamics are much more difficult to analyze, and we will not do so here.

The partition function turns out to be the most important quantity to analyze. Unfortunately it also happens to be notoriously difficult to calculate. Recalling the definition of the mean field, we can rewrite it as:

This functional form is a sum over all possible magnetizations, followed by a sum over all states that have that magnetization. One thing we can instantly see from this is that the magnetic field, h, is acting as a bias. It suppresses belief in states that do not have a magnetization that is the same sign as the magnetic field. This is the reason why these quantities have the names they do: the magnetic field magnetizes the system.

With this machinery at hand we can now introduce dynamics. We initialize our system in any state we wish and then allow for transitions to new states: we let the system evolve in time. We use the probability of a graph state to find the probability that a single vertex transitions by simply holding all other vertices constant:

At each timestep we pick some fraction of the vertices and calculate the probability that they change their state. With that probability we allow those vertices to either change or stay the same.

I’ve been running a lot of these types of simulations on scale free networks. Many networks in reality, including the Internet and Social Networks, have been shown to be scale free. I will write an upcoming post on how to actually create these types of networks, amongst others, but for this post I wanted to focus on the Ising model on graphs. Heres a simulation of all that I’ve been describing:

# Truthfulness and Perception in the 2016 Primaries

Recently I came by an article on Blue Nation Review claiming that Hillary Clinton is the most truthful of the candidates based on statistics from Politifact‘s Truth-O-Meter. Sure, she has the largest percentage of True statements, but a measure of truthfulness should draw on the full distribution of statements that a candidate has spoken, not just a particular subset. Clinton has also spoken partial truths, half truths, falsehoods, and racked up two cases of “Liar, liar, pants on fire!“. What other information is lurking in the full distributions that is lost in BNR’s simplistic analysis? Being enamored by information theory and epistemology, I decided to look at Clinton, Cruz, Kasich, Sanders, and Trump‘s Truth-o-Meter’s a little closer.

To understand this analysis, imagine an idealized thinking being, an epistemic agent(EA), which is initially completely ignorant of the truthfulness of candidates: the EA has no preference for believing what a candidate says when they say it. The EA is then exposed to statements by the candidates, and uses its fact checking facility (the journalists at Politifact) to inform it of the truthfulness of those statement. As it learns more its preferences shift. Analyzing more statements moves the EA from its tabula rasa starting state to a biased one where there is a preference to believe or not believe a candidate. The more statements the EA hears, the less the belief function changes, so that after enough statements, the EA has a pretty definite belief in the truthfulness of the candidates. We begin then with an ignorant EA, which opens up the Politifact site and begins reading the analyzed statements made by the candidates in the 2016 Primary election.

The number of statements analyzed for each candidate varies(above). Kasich and Sanders have been the least scrutinized (61 and 75, respectively), while Hilary has been the most scrutinized (174 statements) by the site. Because of the differing amounts of data going into informing the EA, its belief in Clinton‘s truthfulness will be the least malleable, while in the case of Kasich it will be most malleable. A new datum from Clinton will not sway the EA much in terms of what it thinks of her as it would for both Kasich and Sanders. Cruz and Trump are in between these extremes. We call the EA’s initial beliefs about the candidates priors, and the belief’s after assimilating all the data on the candidate Truth-O-Meters the posteriors. You can read more about how epistemic states of knowledge change because of experiences here and here.

The posterior belief coming from an uninformative prior matches the normalized frequencies. Truth-O-Meter uses a six point scale, but in common speech truthfulness is a continuous variable due to its dependence on not just facts, but context, intensity of language, etc. The closer to 0, the more false a statement is; the closer to 1, the more true it is (recall that these terms are always with respect to the fact checking facility of the EA). Interpolation to the continuum belief distribution was accomplished by using a cubic Hermite spline, as described here. This choice does an excellent job preserving the shape of the discrete frequency distribution. Below we compare the discrete frequencies with the credence/belief distributions of the EA. The area under a curve represents the probability that a statement will have a particular range of truthfulness; the total area under any of these curves is unity.

These curves are interesting because they show what the beliefs of an epistemic agent look like under the assumption that it is judging the candidates solely on what they have said, and NOTHING else. Of course real people are not EAs, and past experiences end up coloring the judgements we make concerning new experiences. Nonetheless, an EA’s credence function might be a good indicator of what someone who is not very informed about the history of the candidates ends up thinking about them after reading through all the statements analyzed by Politifact.

What value of truthfulness does an EA associate with half the things that a candidate says? The answer is found by looking at the cumulative distribution functions (above). CDFs tell us how manifold statements are that come out of a candidates mouth below some value of truthfulness. We observe that the candidates cluster into groups: Clinton, Sanders, and Kasich lay in one, while Cruz and Trump occupy others. The truthfulness at which the EA believes that half of a candidate’s statements are more truthful than, and half are less truthful than, is the median.

Thus far, it is safe to say that both Trump and Cruz do not inspire much faith in the epistemic agent. The other three candidates all seem to be on equal footing, which means we need a better way to distinguish what the EA actually believes about them.

A natural next step would be to look at the moments of the credence distributions, however given the finite domain and presence of significant tails, standard interpretations of deviation, skewness, kurtosis,etc. would fail. Fortunately, rather than applying these ad hoc statistical measures, we can get a good grasp of the shape of the distributions by examining their Shannon entropy:

Shannon entropy is an informational measure that quantifies the amount of uncertainty in a distribution. Credence distributions with low entropy contain a lot of information in them, meaning that an EA is less uncertain about the truthfulness of future statements. Our EA started off with beliefs that had maximal entropy: those based on ignorance. By interacting with the candidates, the EA has acquired information that has biased its beliefs so that it can make more informed judgements. Another way to say this is that the negentropy (the difference between maximal and actual entropy) is describing how an EA views the consistency of a candidate with regards to truthfulness.

There is a problem with this line of thinking, however: It only applies to median truthfulness values that are close to a half. The reason for this is that in order to get a high(or low) median truthfulness, the distribution must necessarily be skewed and hence have a lower entropy. One can estimate the maximum possible entropy of a distribution with truthfulness $\tau$ by considering a piecewise constant distribution. The resulting minimal negentropy can then be written in a closed analytical form: $-\log 2\sqrt{\tau(1-\tau)}$. The difference between the actual negentropy and this minimal value then gives a good measure of the consistency of a candidate in the eyes of the EA.

We plot both truthfulness (as measured by the median), versus consistency  (as measured by admissible negentropy), and see more of the differences between how an EA reasons about the candidates. Kasich falls out of the company of Clinton and Sanders; the EA thinks he’s nearly three times less consistent with his statements than Clinton. Cruz‘s consistency is on par with both Sanders and Clinton, while Trump beats everyone by a yuge margin in terms of how consistent he is about not being truthful. Sanders and Clinton appear grouped together, but the axes are a bit misleading. Sanders leads in truthfulness by about 3%, and in consistency by about 18% over Clinton.

Using information theoretic tools, one can examine how completely rational epistemic agents construct their beliefs out of experiences. In this toy study we saw that the technique can be used to analyze how journalistic derived data concerning the factuality of primary candidates’ statements during a campaign can shed light on the truthfulness and consistency of those candidates. It also shows us that running with simple ad hoc statistics (cough cough) can lead to results that are no longer valid in a more detailed analysis. Speaking of more detail… there are a million ways (Information geometry of the candidate public perception manifold, time dependent Bayesian updating of EA states, testing the effects of nonuniform priors due to coupling between socioeconomic variables, etc.) to make this analysis more complete, but there’re only so many hours in a day. If you have any questions about my methods please feel free to leave a comment.

# Part IIa: Probability and Information

Claude Shannon gave birth to the field of information theory in his seminal paper A Mathematical Theory of Communication. In it Shannon develops how data is transferred from a source to a receiver over a communication channel. Information in the theory rests in how meaning is conveyed through symbols, making some combinations of symbols appear more often than others. This allowed him to apply the frequentist interpretation of probability to word distributions, and resulted in a rich theory centered around the information measure that mathematical history will forever remember as Shannon entropy:

We will eventually get to this measure, but by a different path. Already we have discussed the limitations of the frequentist interpretation of probability, and I have stressed a more general interpretation from the school of Bayes. Shannon’s work is lacking an epistemic touch, and information seems precisely like the thing that we, as thinking beings, should be able to understand through epistemic considerations.

Last time we showed how, under very sensible assumptions, that the notion of belief can be quantified in a manner that makes it equivalent to probability theory. This foundation for the Bayesian interpretation of probability theory is the connection between epistemology and information that this post will be about.

Given a Realm of Discourse (RoD), and epistemic agent (EA) consider the act of the EA interacting with data i.e. having an experience. If the experience changes the EA, then the experience is said to be informative, and contains information. The thing that changes about the EA is its state: its belief function on the RoD. But how does it do that?

Prior to the experience, the EA has a set of beliefs: the prior belief. After the experience the EA has a new set of beliefs: the posterior belief. What distinguishes the latter from the former is the incorporation of the experience into the EA’s history. How does the posterior differ from the prior at a quantitative level?

This is where the rules of probability come in. Consider an arbitrary statement, $s$ , and a statement describing an experience, $e$ . The belief in their conjunction can be decomposed via Bayes’ rule in two different ways, $b(s\land e|H)=b(s|H)b(e|s\land H)=b(e|H)b(s|e\land H)$ In the former decomposition we can identify the prior, and in the latter the posterior. Rearranging to solve for the posterior we have:

We have introduced the likelihood function, $\mathcal L$ , which transforms the prior into the posterior through simple multiplication of real numbers. Note that the likelihood is a function of both the EA, through its dependence on history, and that of the data (experience). Let’s take a closer look at it. We rewrite the denominator as a sum of unmarginalized beliefs:

What does the likelihood mean, and why is it important? We answer the latter by noting that it is that which changes belief, and hence if we are searching for what information is, then here is where we will find it. The numerator of the likelihood is the belief in $e$ under the assumption that $s$ is accepted into the history of the EA. This is where the likelihood gains its name, the numerator is the likely belief in the experience given knowledge of the statement. The denominator is a weighted average of the likeliness of belief in the experience given knowledge of every statement. Because this is a weighted sum rather than just a sum, the likelihood is NOT normalized. Since the prior and the posterior ARE normalized, the likelihood in a sense describes a flow of belief in the RoD, causing the belief in some statements to wane and others to wax. The likelihood constrains this strengthening and weakening of belief to ensure the normalization of belief post experience, and hence keep beliefs comparable throughout the process.

Now what of information? If it is that which changes belief, and multiplication via the likelihood updates prior to posterior, then information must be related to the likelihood. For an EA, then, we can write that the information contained in an experience about a statement is $I_s(e) = h[\mathcal L(s,e|H)]$, where the likelihood is for the experience in the context of the EA’s history. We posit continuity for h without justification. We wish to explore this relationship, but to do so we really need to think a little harder on how several experiences alter belief in a stepwise fashion. Consider an EA that has two experiences, $e_1\land e_2$. Taken simultaneously as a compound experience we would have:

If we instead decompose the experiences individually, updating our prior twice, we find:

This is telling us how the likelihood for a compound experience can be decomposed into consecutive likelihoods. Note that the events have an implicit timestamp associated to them, so changing the order of experiences is inherently changing the experiences themselves, so order does matter. We look now at the idealized case where the two experiences are independent of one another; by this we mean that they are close enough to one another in time that their timestamps can be switched without altering their nature, resulting int he order independence of the compound event. In this idealization the likelihood factors as:

The independence of the the experiences means that information gained about a particular statement from one should be independent of the information gained from the other: $I_s(e_1\land e_2) = I_s(e_1)+I_s(e_2)$. Denoting $x_1=\mathcal L(s,e_1|H)$ and $x_2 = \mathcal L(s,e_2|H)$, independence results in the functional equation:

We will now show that this equation has a unique solution, which will give us the functional form relating belief with information. First consider $x_1 = x_2 = x$. One can easily see that for any positive integer $n\in \mathrm Z^+$, we have that $h(x^n) = h(x)+\cdots + h(x) = nh(x)$, which we will refer to as the power property. It must hold for $x=1$, so it must be the case that $h(1)=0$. This property has a simple interpretation. Recall that a likelihood of one leaves the posterior equivalent to the prior. It does not change belief. Such an experience contains no information. Next note that $0 = h(1) = h(xx^{-1})=h(x)+h(x^{-1})$ so that $h(x^{-1})=-h(x)$. This extends the power property to negative integers. We can extend it to the rationals by considering $0 = h(1)=h(x^\frac{n}{m})+h(x^{-\frac{n}{m}})=nh(x^\frac{1}{m})-mh(x^{-\frac{1}{n}})$. Changing variable to $x=y^m$ we find $h(y^\frac{n}{m})=\frac{n}{m}h(x)$. From the rationals to the reals simply involves invoking the denseness of belief and the continuity of h. Since the power property now applies to all reals, we can differentiate it with impunity wit respect to the exponent: $h(x)=h'(x^n)\cdot x^n\log x$. Multiplying by $n$ to absorb in the lhs, changing variables to $u = \log x$, and integrating, we find $h(x) = C\log x$ where $c$ is a constant of integration.Thus information takes the form:

The constant will be dealt with shortly, but note that it can be partially absorbed into the base of the logarithm, allowing us to express information in whichever base we choose to be convenient. Typical choices include 2 (the unit being called a bit), $\latex e&bg=000000&fg=ffffff$(nats), 3(trits), or 10(dits). The remaining freedom has to do with the sign of the constant.

The closer to 1 the likelihood is, the less information there is about the statement in the experience.  This make intuitive sense, since likelihoods close to 1 hardly change prior belief. Likelihoods that decrease the prior have negative logarithms, and those greater than 1 have positive logarithms. Hence the sign of the constant of integration depends on properly analyzing the meaning of increasing and decreasing belief. Consider then writing the likelihood in terms of the posterior and prior:

This is interesting. The information contained in the likelihood  can be written as a difference between the logarithm of the posterior and prior. We define now self-information $i(s|H) = C\log p(s|H)$. The information contained in an experience is the difference between posterior self-information and prior self-information. It is natural to assume that self-information must be non-negative, which pins down the sign of the constant of integration to being negative (since probabilities are bounded from above by 1):

This definition captures the essential property that information in an experience about a statement is related to how much the experience changes an EA’s belief in that statement.  Though it seems we are at the end of our journey in deriving an analytical form for information, it turns out that we are not. In the next post we’ll examine the issues with the above form, and use the consistency of the Realm of Discourse to deal with them and patch things up.

Till then!

# Part I: Belief and Probability

What is Information?

Is there information in a book? Is it in the pages and extracted when it is read? Why don’t two people get the same amount of information from the same book? For that matter, why do I feel informed differently when I read the same book a second time? Or if I read the same book but it’s written in another language? Why do I matter so much when I’m talking about the information contained in a book?

The first lesson of understanding information is that it is contextual.

It’s impossible to disentangle the two aspects of what make information: the thinking thing, which I will refer to as an epistemic agent (EA), and the interaction of the EA with data, which I will refer to as an experience. An EA changes, it feels informed, when it has an experience that is informative, so information must be related to experiences that change an EA.  An experience is interpreted through the lens of the EA, so information must be related to the state of the EA.

The question that must then be addressed is what is it about the EA that changes during an experience? Put more concisely, what is the state of an EA?

We can look to ourselves for guidance in this matter. Whenever I have an experience that I feel is informative, it is because the experience has made me see things in a new light. I believe in certain things more strongly, and in other things less so after the experience. My beliefs about the world, whatever that may be, change when I am informed. The more moved I am by an experience, the more change occurs to my beliefs. At the very least, for the most mundane of experiences, my belief in the statement ‘I have experienced that’ increases, all else remaining the same.

If belief is the key to defining the state of an EA, then the next question is what are these beliefs about? I have already mentioned belief in a particular statement, so let me expand on that in a rather Wittgensteinian fashion. The world is what it is, and EA’s have pictures of the world that they believe. It is important to discern between the world itself, and the picture of the world. The ontology of the former is important, but it is the epistemology of the latter that we must look to. The picture of the world is painted with language, and statements are the atoms through which the picture is generated. As an EAs beliefs in statements fluctuate, so does the picture. The picture of the world that an EA has is an evershifting mental masterpiece, updated through experiences from the blank slate of a child to the well colored vision of an adult.

To be more precise, let $\mathcal R = (S,L)$, be a Realm of Discourse (RoD) consisting of a set of statements about the world, $S$, and logical connectives, $L:S^n\rightarrow S$. The latter are n-ary maps from multiple statements to statements. The logical connectives imply that there is a subset of $S$ known as atomic statements, $S_0 \subset S$, which form the most basic things that can be said about the world within the RoD. The remaining statements are called compound statements, for obvious reasons. Some examples of logical connectives, for all $s,s'\in S$, are the unary function of negation, $\neg(s) = \neg s$, and the binary functions of conjunction and disjunction, $\land(s,s')=s\land s'$ and $\lor(s,s')=s\lor s'$. Logical connectives are the rules by which the RoD is constructed, generating all possible statements that could be made about the world by EAs. The key thing to get from the preceding is that an RoD establishes all possible pictures of the world that can be reached, and hence puts limits on what can be said. What is outside of an RoD, an EA cannot speak of.

With all possible pictures of the world in hand, what determines the particular picture of the world that an EA has, and hence what is the state of the EA? We are now back to thinking about belief, and are almost in the position to define it quantitatively. From the RoD, a picture of the world is the degree to which an EA believes in every possible statement. The sky is blue and vaccines cause autism are examples of statements that an EA has some belief in. Note that we are NEVER talking about the truth value of any particular statement, only the degree to which an EA believes it. The epistemology is ontologicaly neutral. What the world is is less important than the picture of the world that the EA has. It is the picture of the world that we must examine; to understand how it changes due to experience. Furthermore, the beliefs of an EA are history dependent, in that they are what they are due to the experiences that an EA has had. More on this in a bit.

An interesting aspect of belief, which I alluded to by describing it as a degree, is that it is transitive: if I believe that I am a human being more than that I am an animal, and I believe that I am an animal more than that I am a rock, then necessarily I believe more so that I am human being over that I am a rock. This transitivity of belief is incredibly powerful, because it means that beliefs are both comparable and ordered. I can take any two beliefs I have and compare them, and then I can say which one I believe in more. It may be difficult to compare two beliefs that are very similar, but upon close inspection it appears all but impossible to find a pair that are fundamentally incomparable. These two properties of belief form the first of Cox‘s axioms, whose work has heavily influenced my thinking.

The ordering of belief has a wonderful consequence: we can model the degree to which an EA believes in statements from the RoD by real numbers. A picture of the world is a mapping from the RoD to the continuum, assigning to each statement in the RoD a real number. Recalling the history dependence on past experience, we denote the real number that describes the belief in statement $s\in L$ by $b(s|E)$, where $E$ describes the set of all past experiences relevant to the statement $s$.

Where do the logical connectives of the RoD come into play? This brings us back to the difference between atomic and compound statements. A picture of the world is rational if the belief function on the RoD obeys the second and third of Cox’s axioms: Common Sense and Consistency. Common sense reduces the space of possible relationships between beliefs in atomic and compound statements. Consistency results in a pragmatic rescaling of the RoD map, rendering the common sense relationships in a simpler form. These rescalings are referred to as regraduations of belief, and the end result will look very familiar.

Moving forward, let’s examine what we mean by common sense. Imagine a unary connective acting on a single statement, keeping all else fixed. Common sense dictates that the belief in the transformed statement should somehow be related to the EAs belief in the original statement:

Similarly for binary connectives, belief in a compound statement should be related to the belief in the statements separately, and dependently on one another:

One could go ahead towards arbitrary n-ary connectives, but writing down these relationships becomes quite unwieldy. Fortunately all we need is contained in the relationships $f:\mathrm R\rightarrow\mathrm R$ and $g:\mathrm R^4\rightarrow\mathrm R$. In fact, we can simplify things even further by restricting ourselves to the particular RoD that has negation and conjunction connectives. In this RoD all other logical connectives can be written as combinations of these two (For example disjunction can be expressed as $s\lor s' = \neg (\neg s \land \neg s')$; learn more at propositional logic). In this RoD the common sense relationship for binary operators is trivial for many combinations of its arguments, and reduces the domain to a 2-dimensional subspace of the original domain. There is still a freedom to choose which two arguments. Given the commutativity of conjunction, either pair $(b(s|H),b(s'|s,H))$ or $(b(s'|H),b(s|s',H))$ lead to the same results, so we choose the former for simplicity.

The application of common sense has lead us to necessary relationships between the belief function which generates the EA’s picture of the world, and the logical connectives the RoD is equipped with which limit what the EA can speak of. One could have, of course, demanded more complicated relationships, or none at all, but then one would have to argue why my belief in it is raining is not at all related to by belief in it is not raining. Common sense is epistemically satisfying, and, as will be shown, incredibly powerful in its restrictiveness on the possible forms of belief.

Consistency is the final ingredient that must be incorporated into the quantification of belief. It demands that if there are multiple ways of constructing a statement in the RoD, then the common sense relationships should act in such a way that belief in the statement does not depend on the path used to get to it. For example I could decompose the compound statement $s\land s'\land s''$ as either $s\land (s'\land s'')$ or $(s\land s')\land s''$. Consistency requires that associatively equivalent decompositions lead to the same belief value. This alone is quite powerful; consider the implications for $g$ due to it. Denote $x = b(s|H)$, $y = b(s'|s,H)$, and $z=b(s''|s,s',H)$. The above mentioned associativity implies:

This is a functional equation, and it is a beast. Functional equations make differential equations look like adorable little cupcakes. Functional equations are beautiful. Solving them, if possible, requires far more dexterity at analysis than attacking other types of equations, and this is but the first functional equation that we will find on our journey towards an epistemic understanding of information.

The solution to the associativity functional equation requires showing the existence of a function $\psi$ that satisfies:

The existence proof is long and tedious, but checking that the associative functional equation is satisfied is straightforward. Once one is convinced of this, we can regraduate(rescale) beliefs with $\psi$, $b \rightarrow \psi \circ b$. Why would we do this? The only reasonable explanation is pragmatism. This regraduation transforms the relationship for conjunctions into simple multiplication, yielding:

If you just got a tingly sensation, you’re not alone…

Recalling that $\neg\neg s=s$, consistency produces a second functional equation from the first common sense relationship on unary operators:

This equation is sometimes referred to as Babbage’s equation, and has a long history. One can verify quickly that for any invertible function $\phi$, a solution to this equation is of the form:

Regraduating beliefs once more with this function, $b\rightarrow \phi\circ b$, the unary relationship becomes:

Let’s discuss these tingly sensations that we’re feeling in a moment. It’s a good idea to note that we have just derived that the belief in a negation is a monotonically decreasing function of the original belief. This means that the more an EA believes in a statement, the less they believe in the negation of the statement. Common sense, right?

One final consideration should be made for the bounds of the regraduated belief function. To find bounds we should consider statements that are purposefully complicated such as $s=s\land s$. We denote maximal belief by $b_T$, and apply the multiplication rule, $b(s|H) = b(s\land s|H) = b(s|H)b(s|s,H)=b(s|H)b_T$. Since an EAs belief in a statement that is part of their history is maximal, this implies that $b_T =1$. Furthermore, the negation of a maximal belief is minimal due to the monotonicity of the negation relationship. Denoting minimal belief as $b_F$ one can quickly show that $b_F = 0$. We have then that beliefs are real numbers in the interval $[0,1]$

To the astute reader the regraduated belief function satisfies two rules which are identical to those found in probability theory: Bayes’ Rule and the Normalization Condition:

Seeing this coincidence prompted Cox to wonder, as many have, on the nature of probability. The predominant view of what probability is stems from the frequentist school of thought. Probabilities are frequencies. When I say that a coin has a probability of landing heads of .5, one usually explains this by discussing an experiment. A coin is tossed many times (or an ensemble of identical coins are all tossed at once) and the number of heads is counted and divided by the total number of throws. The resulting number is not necessarily equal to .5, but it’s probably close, and the frequentist will then tell you that if you just performed the experiment an infinite number of times then the frequency of heads would approach one half.

That’s cute, but then a planetologist tells you that the probability of life on Mars is 42%. Do you envision an experiment where a multitude of Universes are created and the planetologist counts the number of them that have a Mars with life in them, and then divides by the total number of Universes that were created? Let’s not forget they have to keep making Universes forever if we want to apply the frequentist definition of probability. Clearly this interpretation cannot handle such a statement.

An opposing school of thought on the matter is that of Bayes. The Bayesian interpretation of probability eschews the use of frequencies, and casts probabilities as parts of the state of knowledge. Here at last we see the connection between the epistemological theory of belief developed by Cox( read his original 1946 paper Probability, Frequency, and Reasonable Expectation) and the Bayesian school of thought. Since belief obeys the same rules as probability, it is no leap to conjecture that $p=b$.

Probabilities ARE beliefs.

When the planetologist tells you that there is a 42% probability of life on Mars, they are telling you, based on their past experiences (education, research, exposure to popular culture, etc. ), how strongly they believe life exists on Mars. Their background is suited for them to have a well thought out belief, and so their statement has weight. Meteorologists do the exact same thing with the weather, which is why sometimes you’ll notice that different weather sites have slightly different probabilities for future weather patterns. These differences reflect the differences in belief that the meteorologists (Or I should say the models they’re using to analyze meteorological data) have in predicting the future based on the data that they have interacted with.

What has been done here is an exposition on the epistemic foundations of the Bayesian interpretation of probability theory. By grounding the interpretation of probability in an epistemic theory, we can now move forward with our main investigation of what information in. The tools that have been developed will help us in this journey, since now we see how an epistemic agent has an internal state that is defined by a belief function on the Realm of Discourse. This belief function creates a painting of the world which influences how the EA behaves when novel experiences are had. Beliefs can be analyzed via the rules of probability, and, in particular, how they change will lead us to an answer to the question What is Information?

# A Simple Economy

What is a fair economy?

I’ve been thinking about that a lot lately, and decided to model a simple economy to try to understand how economic transactions affect the distribution of wealth in a society. I’m not going to be addressing the big questions of policy here, I’ll simply be looking at how a mass of people exchange money to see if there are equilibrium distributions and how fair those distributions are.

First off let’s review a basic metric that tries to capture how much economic equality there is in a society, the Gini Index. Take a group of people and their individual wealth, and plot the percentage of the population versus the percentage of wealth. If the group has an equal distribution of wealth then 10% of the population will have 10% of the wealth, 60% of the population will have 60% of the wealth, etc. That means the plot will be a straight line, the Line of Equality:

If we do this for a group of people in a state, or country, or continent, we get a curve that lies below the Line of Equality, the Lorenz Curve:

The area between these two curves is the Gini Index (divided by two), and gives us a measure of inequality. The more area there is, the more wealth is concentrated in a small fraction of the population. The less area there is the more evenly distributed the wealth is. A Gini Index of 0 is equality, whereas an index of 1 is the case where one person has the wealth of the entire population. The World Bank tries to calculate this quantity for every country, and  UConn has a great site displaying the indices for all the counties in the US.

I wanted to see what happens to the Gini Index in a simple economy which I decided to model. I took a population of N individuals ( I did experiments with economies ranging from a thousand to a million individuals), and gave each individual the same amount of wealth. I then simulated the economy by having individuals randomly pair up with one another, and with some probability, have an economic transaction. The economic transaction consists of one of the individuals, a payer, giving one unit of wealth to the other individual, a payee (presumably in exchange for some activity). Debt is not allowed, so if the payer does not have one unit of wealth, then the transaction is cancelled. This model has a conservation of wealth, in that the sum of wealth of all individuals does not change.

Here’s how an economy of 100,000 individuals, where at every timestep there is a 4% chance of an economic interaction between at least one pair of individuals, evolves over time:

I let this economy run for years and years, and watched the distribution of wealth evolve. The Gini Index seems to be stable around .5 and the wealth rests in an exponential distribution.

Results are consistent with a MaxEnt analysis. In MaxEnt we write down an entropy functional which captures the constraints we have for the system (total number of individuals, and total wealth). Maximizing this functional leads to the distribution that is the most ignorant it can be while still satisfying the constraints.

Consider $M$ units of wealth in a population of $N$ people in a continuum limit. Denote the fraction of people that have wealth between $m$ and $m+d m$ by $n(m)dm$. The entropy functional is:

Since we are dealing with a continuous distribution, the first term is the Kullback-Liebler Divergence, with the second distribution being uniform. The second term is the normalization constraint for the population, while the third is the constraint on the average wealth. $\alpha$ and $\beta$ are Lagrange multipliers that enforce the constraints. Moving on, the population fraction and wealth fraction are, respectively:

Maximizing the entropy function with respect to the two Lagrange multipliers gives us the constraints, while maximizing with respect to the distribution gives us the exponential form found in the numerical experiment:

where we have used the constraints to solve for the Lagrange multipliers. Plugging these back into the equations for the population and wealth fractions, and eliminating the wealth variable, $m$, one can flex their algebraic muscles to show that:

This is the functional form of the Lorenz curve, and can be used to extract the Gini Index:

So the equilibrium distribution in the continuum limit (infinite population and infinite wealth) is an exponential, and its Gini Index is .5.

This is an interesting result since it is halfway between an equal distribution of wealth, and the most unequal distribution of wealth. Given the symmetric nature of the economic transactions (no benefit is gained by having more or less money), this economy seems pretty fair, yet the resulting wealth distribution is quite unequal. So what’s going on here?

First of all the it must be stressed that individuals are performing a random walk, so the distribution represents the amount of time that a particular individual will spend(on average) with a particular value of wealth. This is made quite evident if we follow several individuals’ wealth in a simulation:

Each individual wealth jumps around quite a bit. The Gini Index is not capturing this particular aspect of wealth, and the question of whether we can construct a better metric for measuring wealth disparity is one I may get back to in the future. It should also be noted that this model is equivalent to that of a gas consisting of atoms that have kinetic energy and are allowed to exchange energy in completely elastic collisions. The economic agents are the analog to the atoms, and wealth is the analog of kinetic energy. The exponential distribution seen here is analogous to the Boltzmann distribution, with the average wealth playing the role of temperature.

A major drawback of this model is that we haven’t incorporated any interesting dynamics such as debt, inflation, asymmetric transactions, or more complex financial entities. Some of these have been explored in a papers by Yakovenko and Caticha, to name a few. Since economics is not a zero sum game, and wealth can be generated by both parties during a transaction, it would be interesting to see how any of these types of dynamics end up affecting the equilibrium wealth distribution and consequently the Gini Index.

Hopefully in a future post I’ll explore this model a little more, and incorporate some of the aforementioned aspects of more realistic economies. We’ll see how much time I can actually squeeze into this econophysics sideproject… till next time!