Entropic Compression and English

Entropic Compression and English

Let’s say that we want to send a message to a friend. What’s the smallest amount of information we need to send that message? The answer to this question is related to the entropy of the alphabet we use to write the message in. This post will cover exactly how to do that.

First some definitions. A letter is a symbol, which we denote a_i, used to write messages. The subscript runs over all the letters in the alphabet being used, \mathcal A = \{a_i\}_{i=1}^N. The size of the alphabet we use is N letters long. A message consists of a string of letters, \mathcal M = a_{i_1}a_{i_2}\cdots a_{i_M}. The size of the message is M letters. To convey the message we need to store each letter and send it, so the way in which we store letters will determine the amount of storage we need to send the message.

Let’s first think through what we mean by storage. Since the alphabet has N letters, each letter can be stored in either a single N-it, or in \log N bits. Think of the latter as the number of yes/no questions one needs to answer to pinpoint which element of the alphabet a particular letter is (If the logarithm is fractional then we need to round up). Hence the entire message needs M \log N bits to be sent. Is this the best we can do? Obviously no, otherwise this post wouldn’t exist.

The key to compressing the message is to use the entropy of the alphabet. The lower the entropy of the alphabet, the more compression we can achieve. The key is to store the letters of the alphabet so that the most common letters use up less storage (fewer bits). This is at the expense of rare letters using more space. This means that certain messages, when restored under this new scheme, will actually use up MORE storage. However, these messages are rare, and, on average, the majority of messages will take up less storage.

So, how is this new storage scheme accomplished?

First we need to find the probability of appearance of each letter, p_i = p(a_i) . This can be done by looking at a corpus of messages that have already been sent to see which letters are used more often than others, or through the use of an epistemic agent that starts off ignorant and observes messages, refining its belief distribution for the alphabet. Once the probabilities are known, the information content in each letter is:

https://i2.wp.com/mathurl.com/hp7pjww.png

The ceiling of this value is the number of bits entropic compression requires to store that particular letter. Hence, each letter can be mapped to a code word that is that many bits long. So what is the average number of bits needed to store a particular letter?

https://i2.wp.com/mathurl.com/zmccsbm.png

Voila! The entropy of the alphabet coincides with the average number of bits needed to store a letter. Since the entropy is maximized for an equiprobable alphabet at S^{max} = \log N , this scheme is will use no more space than the original. To see this we consider long messages. The total information needed to transmit them will be M S[\mathcal A] \leq M\log_2 N . Now, there is a small caveat here, since we did use the ceiling function, but we won’t worry about that.

So what does this mean for messages sent in English? Assume an alphabet of only 26 letters (don’t distinguish between upper and lower case and no spaces). The average number of bits needed to convey a letter is \log_2 26\approx 4.70 . But we aren’t writing just any messages, we’re writing messages in English. I downloaded the 11th edition of Encyclopaedia Britannica, which gave me a dataset of over 40,350,587 words. Let’s take some time to discuss it, first starting with letters, and then moving up to clusters of letters and words.

Parsing up all those words into their constituent letters gave me 192,439,482 instances of letters in the English language. How often does each letter occur?Screen Shot 2016-04-06 at 8.05.30 PM.pngAs you can see ‘E’ is the winner, and vowels show up pretty often. Q’s and Z’s do not. We can calculate the entropy of English based of of this letter distribution, and find that the average information needed to store a letter is S[\mathcal A] \approx 4.17 . Hence this scheme can compress messages by about 11%. Now to actually implement this scheme requires figuring out the code words to map each of the letters to. This can be accomplished quite elegantly using a Huffman tree, which also gives entropic compression its standard name: Huffman coding.

It should also be mentioned that this compression scheme also provides the maximum compression that can be accomplished based on the constraint imposed by letter frequencies on which we trained the epistemic agent whose beliefs we’re using. This is because the maximized entropy under the constraint is the most ignorant belief function allowed for: lowering the entropy implies that the epistemic agent is not maximally ignorant. But if the EA is not maximally ignorant beyond the information given to it by the constraint, that means that there must exist some other information (constraint) that it is using to constrain its beliefs. Since, by assumption, the frequency constraint is all the information the EA has to work with, a lower entropy implies a contradiction.

Now, this compression is great, but can we add more constraints from the English language to make it even better? What sort of constraint should we add?

If you are given a word to guess the first few letters might take around 4-5 tries. That’s what the entropy means! You’ll need to ask on average 4.17 questions to get the right letter. However, once you have the first one or two letters you’ll be able to guess the remaining letters pretty quickly. If you see a ‘q’ you’re going to guess ‘u’  is the next letter and you’ll most likely be right. That’s because English has much more structure than just the frequency of letters. Some letters tend to follow other letters more often than others. This following preference can help us construct an even better compression algorithm!

Once again I took the Encyclopaedia Britannica and did an analysis of which letters follow each other. I got the frequencies for all pairs of letters in words, and then calculated the conditional probabilities via Bayes’ Rule:

https://i0.wp.com/mathurl.com/huurx4h.png

The conditional probabilities are referred to as 1-transitions, since they tell you what the probability is of transitioning to a new letter given that you have 1 letter of knowledge. The results are pretty neat:

Screen Shot 2016-04-06 at 8.20.39 PM.png

The scale is logarithmic, so the colors go in powers of 2, and the sum across rows is normalized. The difference between red and blue is 16, so red is over 64000(2^16) times as likely as blue. Note the patterns in this plot. Consonants love to be followed by vowels. ‘B’ loves to be followed by ‘Y’ and ‘E’. ‘Q’ is almost always followed by ‘U’. There are some weird ones in there too. Why is ‘Q’ sometimes followed by ‘V’? That’s probably due to the fact that the Encyclopedia’s digital version hasn’t been finished being edited. The original was scanned and turned into letters based on the scanned shapes. In older books (and this one is from 1911) had u’s printed as v’s.

Neat, so now we have transition probabilities. How picky are the transitions for different letters? That can be answered by looking at the entropy of each of these distributions

Screen Shot 2016-04-06 at 8.44.39 PM.png

The dashed green line is the entropy of the alphabet. Notice that all the transition entropies are lower, meaning that if you know a letter, it takes less guesses to find the next letter. For letters on the right side of the plot, much fewer guesses are needed. The letters on the left side are less picky about followers, and more guesses are needed. We can perform a Huffman encoding on each of these distributions, and further compress our English messages. As long as the decoder has this list of lists of codewords they will be able to decipher what we send them.

So, how much have we gained by looking at next letters?

The answer has to do with the conditional entropy of the joint distribution of two letters. Put plainly, the entropy of two letters minus the entropy of the first letter gives us the average number of questions we need to ask to guess the second letter if we know the first (Translation: (how many questions do you need to guess the pair of letters) – (how many questions you need to guess the first letter)):

https://i0.wp.com/mathurl.com/zjo8vg9.png

Calculating this entropy we get S[\mathcal A|\mathcal A] \approx 3.29 , or a 30% compression from the original. We’ve almost doubled our compression by considering which letters are more likely to follow which letters.

But why stop there?

Why not look at the probability that a letter follows a pair of letters: the 2-transition probabilities? Or the 3-transition probabilities, etc. Doing this for the data set, I found that English has  S[\mathcal A|\mathcal A\times\mathcal A] \approx 2.82 , S[\mathcal A|\mathcal A\times\mathcal A\times\mathcal A] \approx 2.03 , S[\mathcal A|\mathcal A\times\mathcal A\times\mathcal A\times\mathcal A] \approx 1.19 . If we plot the entropies based on the number of constraints we’re using, we find something amazing:

Screen Shot 2016-04-07 at 7.48.24 PM.png

The line at an entropy of 1 corresponds to one wrong guess, on average, to get the next letter right. We clearly see a linear trend in the entropies approaching zero. If the trend continues then we expect perfect guessing of what the next letter will be somewhere around 6 priors. So what does this mean about compression? Well, it means we need to spend the most bits (on average) on the first letter of each word in a message, and fewer bits on each consecutive letter in that word. We shouldn’t even send the letters beyond the 7th in long words because they’ll be easily filled in by the decoder(most of the time). Recall that with Huffman encoding we were able to reduce the size of a word in a message to  MH \approx M\times 4.17 bits. What is the average storage we need now?

To answer that we really need the word distribution of English: How many words of each length appear in the language? There’s a nuance here, in that we are interested in the usage of the language, and not the corpus of words that make up the language. The latter is over all possible unique words, whereas the former includes repetition. We can plot both distributions:

Screen Shot 2016-04-15 at 2.32.30 PM.png

This is a cool diagram because it tells us that the distribution of used words pulls the average letter count per word down quite a bit. Treating words as unique entities, the mean word length of English is 8.37 letters per word, and 50% of words have less than 7.46 letters in them. Treating words as non-unique elements of messages the mean word length is 4.78 letters per word, and 50% of words have less than 3.55 letters in them. The distributions are equal at 5.19 letters per word, meaning that English words with more than 5 letters are underused, while words with 5 or less letters are overused.

Since we’re dealing with how English is being used to make messages, we should use the blue distribution to calculate the average entropy of a word. We do this by averaging the entropy needed to store a words of length l over all possible word lengths:

https://i2.wp.com/mathurl.com/zfxm9l8.png

The result is pretty amazing.

How amazing?

Consider the original scheme of storing each letter of the alphabet using the same amount of storage. The average word entropy would be about 22.42 bits. With entropic encoding, the entropy we have as a result of the above is 11.44 bits. We’ve achieved a compression of about 50%, which basically allows us to send twice as many messages as before. Pretty neat, huh?

 

Advertisements

The Ising Model: A Graph Theoretic Introduction

The Ising Model: A Graph Theoretic Introduction

In this post I will introduce the Ising model from a graph theoretic point of view, without resorting to a particular graph structure on which the system lives. I will explore more specific systems in a future post.

Consider an undirected graph G=\langle V,E \rangle where V is a set of vertices, and E is a set of edges. A state of the graph, \sigma_G, is a mapping of the vertices, \sigma: V\rightarrow \{-1,1\}: v_i \mapsto \sigma_i, and the edges, \sigma: E\rightarrow \{-1,1\}: e_{ij}\mapsto \sigma_{ij} = \sigma_i\sigma_j, where i and j are the vertices that bound the edge. This definition makes edges that bound vertices of the same state have a state of +1, and edges that bound vertices of opposite states have a state of -1.

Since each vertex can be labeled in one of two ways, and the state of the edges is fully determined by the state of the vertices, there are a total of 2^{|V|} possible states for the graph, where |\cdot| is the cardinality operator. We bundle these together and write \Sigma_G = \{\sigma_G\}.

We want to compare different states to one another, and the easiest way to do this is to define a Hamiltonian. A Hamiltonian is a function that maps states to real numbers, which we call energies. It consists of a piece that acts on the edges, and a piece that acts on the vertices:

https://i2.wp.com/mathurl.com/h7lx6ec.png

where the f’s are coupling functions. In the simplest case we choose f_0(x)=hx, and f_1(x) = -Jx. In the physics parlance, h is referred to as the magnetic field, and J the coupling. Plugging in the definitions for the state of vertices and edges, the form of the Hamiltonian that we will be examining reads:

https://i2.wp.com/mathurl.com/hvzopz8.png

where the second sum on the first line is over unique pairs of vertices, and on the second line over the neighbors of the ith vertex. The factor of a half comes from overcounting each vertex twice in the complete expression. The reason for the negative sign is that it ranks states that are more homogeneous lower, so we can say that homogeneity decreases energy.

We can make the notion of homogeneity more precise by introducing an order parameter, called the magnetization, describing the average vertex state:

https://i1.wp.com/mathurl.com/znznxah.png

Note that this is a number between 1 and -1, inclusive. There are many states that have the same order parameter. So many, in fact, that’s it’s best to talk about the logarithm of the number, which turns out to be (for very large graphs):

https://i2.wp.com/mathurl.com/zzcvcsm.png

Here I used Stirling’s approximation for the factorials in the binomial coefficient. This function is peaked close to 0, especially for large graphs. The majority of the states have little or no magnetization; in fact if we expand around 0 magnetization we see that the states are distributed as a Gaussian:

https://i2.wp.com/mathurl.com/z9kkhdc.png

Here I have included the quartic term in the Gaussian, which can be neglected. To see this we plot the distribution below for graphs with 10,100,and 1000 vertices. We see how quickly the distribution becomes sharply peaked at vanishing magnetization:

Screen Shot 2016-03-28 at 11.50.46 PM.png

What does this mean?

We imagine that the system has some sort of dynamics: that it can transition from one state to another state. Under no other constraints, the system will soon find itself in a state with little or no magnetization.

For something more interesting to occur, the dynamics of how the system transitions from one state to another must be constrained by more than simply the available volume of phase space. Here is where we finally get to the physics of the Ising model. The constraint we need to incorporate is that the energy of the system is constant, so that we can say what the probability of each state is: p(\{\sigma\}) . We resort to a MaxEnt analysis: the distribution of states under the constraint of constant energy is the one that maximizes the Shannon entropy of the system:

https://i2.wp.com/mathurl.com/gnfnf2f.png

Two Lagrange multipliers have been introduced to enforce the normalization and energy constraints on the system. The MaxEnt method produces the belief distribution that an epistemic agent should have about the system knowing nothing more than the energy constraint. Any deviation from this distribution implies that the EA has more information at their disposal. After a little bit of calculus, and a little redefinition of the Lagrange multipliers, one can find:

https://i1.wp.com/mathurl.com/jmvydqt.png

This is just the standard Maxwell Boltzmann distribution, and Z is called the partition function (normalization constraint).

I do want to discuss a bit the assumptions that went into this derivation. The actual energy of our system is the Hamiltonian, but since an epistemic agent cannot know the actual state of the system, it must resort to averaging over all possible states to define the energy. If the actual system fluctuates a bit, this means that the actual energy is a time average of the energies the system fluctuates through. The statement that this temporal average is equivalent to the average over the ensemble of all possible states is known as the ergodic hypothesis. This is only possible because we assume the energy is constant (in the temporal average), which means that the system is in equilibrium. If the system was out of equilibrium, the ergodic hypothesis would no longer hold, and the epistemic agent would not have the belief distribution about the system derived above. Out of equilibrium dynamics are much more difficult to analyze, and we will not do so here.

The partition function turns out to be the most important quantity to analyze. Unfortunately it also happens to be notoriously difficult to calculate. Recalling the definition of the mean field, we can rewrite it as:

https://i0.wp.com/mathurl.com/hcl89km.png

This functional form is a sum over all possible magnetizations, followed by a sum over all states that have that magnetization. One thing we can instantly see from this is that the magnetic field, h, is acting as a bias. It suppresses belief in states that do not have a magnetization that is the same sign as the magnetic field. This is the reason why these quantities have the names they do: the magnetic field magnetizes the system.

With this machinery at hand we can now introduce dynamics. We initialize our system in any state we wish and then allow for transitions to new states: we let the system evolve in time. We use the probability of a graph state to find the probability that a single vertex transitions by simply holding all other vertices constant:

https://i1.wp.com/mathurl.com/glkcjar.png

At each timestep we pick some fraction of the vertices and calculate the probability that they change their state. With that probability we allow those vertices to either change or stay the same.

I’ve been running a lot of these types of simulations on scale free networks. Many networks in reality, including the Internet and Social Networks, have been shown to be scale free. I will write an upcoming post on how to actually create these types of networks, amongst others, but for this post I wanted to focus on the Ising model on graphs. Heres a simulation of all that I’ve been describing:

Ising Dynamics on a Scale Free Network

Truthfulness and Perception in the 2016 Primaries

Truthfulness and Perception in the 2016 Primaries

Recently I came by an article on Blue Nation Review claiming that Hillary Clinton is the most truthful of the candidates based on statistics from Politifact‘s Truth-O-Meter. Sure, she has the largest percentage of True statements, but a measure of truthfulness should draw on the full distribution of statements that a candidate has spoken, not just a particular subset. Clinton has also spoken partial truths, half truths, falsehoods, and racked up two cases of “Liar, liar, pants on fire!“. What other information is lurking in the full distributions that is lost in BNR’s simplistic analysis? Being enamored by information theory and epistemology, I decided to look at Clinton, Cruz, Kasich, Sanders, and Trump‘s Truth-o-Meter’s a little closer.

To understand this analysis, imagine an idealized thinking being, an epistemic agent(EA), which is initially completely ignorant of the truthfulness of candidates: the EA has no preference for believing what a candidate says when they say it. The EA is then exposed to statements by the candidates, and uses its fact checking facility (the journalists at Politifact) to inform it of the truthfulness of those statement. As it learns more its preferences shift. Analyzing more statements moves the EA from its tabula rasa starting state to a biased one where there is a preference to believe or not believe a candidate. The more statements the EA hears, the less the belief function changes, so that after enough statements, the EA has a pretty definite belief in the truthfulness of the candidates. We begin then with an ignorant EA, which opens up the Politifact site and begins reading the analyzed statements made by the candidates in the 2016 Primary election.

numberOfStatements.png
Number of fact checks that have been done on each candidate by Politifact’s Truth-O-Meter Journalists.

The number of statements analyzed for each candidate varies(above). Kasich and Sanders have been the least scrutinized (61 and 75, respectively), while Hilary has been the most scrutinized (174 statements) by the site. Because of the differing amounts of data going into informing the EA, its belief in Clinton‘s truthfulness will be the least malleable, while in the case of Kasich it will be most malleable. A new datum from Clinton will not sway the EA much in terms of what it thinks of her as it would for both Kasich and Sanders. Cruz and Trump are in between these extremes. We call the EA’s initial beliefs about the candidates priors, and the belief’s after assimilating all the data on the candidate Truth-O-Meters the posteriors. You can read more about how epistemic states of knowledge change because of experiences here and here.

The posterior belief coming from an uninformative prior matches the normalized frequencies. Truth-O-Meter uses a six point scale, but in common speech truthfulness is a continuous variable due to its dependence on not just facts, but context, intensity of language, etc. The closer to 0, the more false a statement is; the closer to 1, the more true it is (recall that these terms are always with respect to the fact checking facility of the EA). Interpolation to the continuum belief distribution was accomplished by using a cubic Hermite spline, as described here. This choice does an excellent job preserving the shape of the discrete frequency distribution. Below we compare the discrete frequencies with the credence/belief distributions of the EA. The area under a curve represents the probability that a statement will have a particular range of truthfulness; the total area under any of these curves is unity.

credences.png
The frequency vs. truthfulness of the repertoire of statements said by each candidate.  Truthfulness of 0 corresponds to Truth-O-Meter’s “Liar, Liar, Pants on Fire” category, and 1 is “True.”
pdfs
The continuous credence(belief) function of an epistemic agent for each of the candidates. These are interpolated from the frequencies above by using a cubic spline.

These curves are interesting because they show what the beliefs of an epistemic agent look like under the assumption that it is judging the candidates solely on what they have said, and NOTHING else. Of course real people are not EAs, and past experiences end up coloring the judgements we make concerning new experiences. Nonetheless, an EA’s credence function might be a good indicator of what someone who is not very informed about the history of the candidates ends up thinking about them after reading through all the statements analyzed by Politifact.

cdf.png
The cumulative distribution functions (cdf) derived from the credence functions for each candidate. The shaded region is to emphasize where the median lies for each of the candidates.
halfTrue
Median truthfulness of each candidate. Clearly the candidates break up into two groups based on this measure.

What value of truthfulness does an EA associate with half the things that a candidate says? The answer is found by looking at the cumulative distribution functions (above). CDFs tell us how manifold statements are that come out of a candidates mouth below some value of truthfulness. We observe that the candidates cluster into groups: Clinton, Sanders, and Kasich lay in one, while Cruz and Trump occupy others. The truthfulness at which the EA believes that half of a candidate’s statements are more truthful than, and half are less truthful than, is the median.

Thus far, it is safe to say that both Trump and Cruz do not inspire much faith in the epistemic agent. The other three candidates all seem to be on equal footing, which means we need a better way to distinguish what the EA actually believes about them.

A natural next step would be to look at the moments of the credence distributions, however given the finite domain and presence of significant tails, standard interpretations of deviation, skewness, kurtosis,etc. would fail. Fortunately, rather than applying these ad hoc statistical measures, we can get a good grasp of the shape of the distributions by examining their Shannon entropy:

https://i0.wp.com/mathurl.com/hacrl7n.png

Shannon entropy is an informational measure that quantifies the amount of uncertainty in a distribution. Credence distributions with low entropy contain a lot of information in them, meaning that an EA is less uncertain about the truthfulness of future statements. Our EA started off with beliefs that had maximal entropy: those based on ignorance. By interacting with the candidates, the EA has acquired information that has biased its beliefs so that it can make more informed judgements. Another way to say this is that the negentropy (the difference between maximal and actual entropy) is describing how an EA views the consistency of a candidate with regards to truthfulness.

There is a problem with this line of thinking, however: It only applies to median truthfulness values that are close to a half. The reason for this is that in order to get a high(or low) median truthfulness, the distribution must necessarily be skewed and hence have a lower entropy. One can estimate the maximum possible entropy of a distribution with truthfulness \tau by considering a piecewise constant distribution. The resulting minimal negentropy can then be written in a closed analytical form: -\log 2\sqrt{\tau(1-\tau)}. The difference between the actual negentropy and this minimal value then gives a good measure of the consistency of a candidate in the eyes of the EA.

We plot both truthfulness (as measured by the median), versus consistency  (as measured by admissible negentropy), and see more of the differences between how an EA reasons about the candidates. Kasich falls out of the company of Clinton and Sanders; the EA thinks he’s nearly three times less consistent with his statements than Clinton. Cruz‘s consistency is on par with both Sanders and Clinton, while Trump beats everyone by a yuge margin in terms of how consistent he is about not being truthful. Sanders and Clinton appear grouped together, but the axes are a bit misleading. Sanders leads in truthfulness by about 3%, and in consistency by about 18% over Clinton.

truthConsistency.png
Consistency, as measured by entropic deviation from maximum, versus Truthfulness, as measured by the median of the continuum credence functions. Clustering is evident.

Using information theoretic tools, one can examine how completely rational epistemic agents construct their beliefs out of experiences. In this toy study we saw that the technique can be used to analyze how journalistic derived data concerning the factuality of primary candidates’ statements during a campaign can shed light on the truthfulness and consistency of those candidates. It also shows us that running with simple ad hoc statistics (cough cough) can lead to results that are no longer valid in a more detailed analysis. Speaking of more detail… there are a million ways (Information geometry of the candidate public perception manifold, time dependent Bayesian updating of EA states, testing the effects of nonuniform priors due to coupling between socioeconomic variables, etc.) to make this analysis more complete, but there’re only so many hours in a day. If you have any questions about my methods please feel free to leave a comment.

A Simple Economy

A Simple Economy

 

What is a fair economy?

I’ve been thinking about that a lot lately, and decided to model a simple economy to try to understand how economic transactions affect the distribution of wealth in a society. I’m not going to be addressing the big questions of policy here, I’ll simply be looking at how a mass of people exchange money to see if there are equilibrium distributions and how fair those distributions are.

First off let’s review a basic metric that tries to capture how much economic equality there is in a society, the Gini Index. Take a group of people and their individual wealth, and plot the percentage of the population versus the percentage of wealth. If the group has an equal distribution of wealth then 10% of the population will have 10% of the wealth, 60% of the population will have 60% of the wealth, etc. That means the plot will be a straight line, the Line of Equality:

LineofEquality

If we do this for a group of people in a state, or country, or continent, we get a curve that lies below the Line of Equality, the Lorenz Curve:

LCurve.jpg

The area between these two curves is the Gini Index (divided by two), and gives us a measure of inequality. The more area there is, the more wealth is concentrated in a small fraction of the population. The less area there is the more evenly distributed the wealth is. A Gini Index of 0 is equality, whereas an index of 1 is the case where one person has the wealth of the entire population. The World Bank tries to calculate this quantity for every country, and  UConn has a great site displaying the indices for all the counties in the US.

I wanted to see what happens to the Gini Index in a simple economy which I decided to model. I took a population of N individuals ( I did experiments with economies ranging from a thousand to a million individuals), and gave each individual the same amount of wealth. I then simulated the economy by having individuals randomly pair up with one another, and with some probability, have an economic transaction. The economic transaction consists of one of the individuals, a payer, giving one unit of wealth to the other individual, a payee (presumably in exchange for some activity). Debt is not allowed, so if the payer does not have one unit of wealth, then the transaction is cancelled. This model has a conservation of wealth, in that the sum of wealth of all individuals does not change.

Here’s how an economy of 100,000 individuals, where at every timestep there is a 4% chance of an economic interaction between at least one pair of individuals, evolves over time:

 

GiniEvolution

 

I let this economy run for years and years, and watched the distribution of wealth evolve. The Gini Index seems to be stable around .5 and the wealth rests in an exponential distribution.

Results are consistent with a MaxEnt analysis. In MaxEnt we write down an entropy functional which captures the constraints we have for the system (total number of individuals, and total wealth). Maximizing this functional leads to the distribution that is the most ignorant it can be while still satisfying the constraints.

Consider M units of wealth in a population of N people in a continuum limit. Denote the fraction of people that have wealth between m and m+d m by n(m)dm. The entropy functional is:

https://i0.wp.com/mathurl.com/zhzspet.png

Since we are dealing with a continuous distribution, the first term is the Kullback-Liebler Divergence, with the second distribution being uniform. The second term is the normalization constraint for the population, while the third is the constraint on the average wealth. \alpha and \beta are Lagrange multipliers that enforce the constraints. Moving on, the population fraction and wealth fraction are, respectively:

https://i1.wp.com/mathurl.com/gp8pds2.png

Maximizing the entropy function with respect to the two Lagrange multipliers gives us the constraints, while maximizing with respect to the distribution gives us the exponential form found in the numerical experiment:

https://i1.wp.com/mathurl.com/h7xhx48.png

where we have used the constraints to solve for the Lagrange multipliers. Plugging these back into the equations for the population and wealth fractions, and eliminating the wealth variable, m, one can flex their algebraic muscles to show that:

https://i0.wp.com/mathurl.com/huxtgbe.png

This is the functional form of the Lorenz curve, and can be used to extract the Gini Index:

https://i0.wp.com/mathurl.com/grb6wpx.png

So the equilibrium distribution in the continuum limit (infinite population and infinite wealth) is an exponential, and its Gini Index is .5.

This is an interesting result since it is halfway between an equal distribution of wealth, and the most unequal distribution of wealth. Given the symmetric nature of the economic transactions (no benefit is gained by having more or less money), this economy seems pretty fair, yet the resulting wealth distribution is quite unequal. So what’s going on here?

First of all the it must be stressed that individuals are performing a random walk, so the distribution represents the amount of time that a particular individual will spend(on average) with a particular value of wealth. This is made quite evident if we follow several individuals’ wealth in a simulation:

 

WalletEvolution

 

Each individual wealth jumps around quite a bit. The Gini Index is not capturing this particular aspect of wealth, and the question of whether we can construct a better metric for measuring wealth disparity is one I may get back to in the future. It should also be noted that this model is equivalent to that of a gas consisting of atoms that have kinetic energy and are allowed to exchange energy in completely elastic collisions. The economic agents are the analog to the atoms, and wealth is the analog of kinetic energy. The exponential distribution seen here is analogous to the Boltzmann distribution, with the average wealth playing the role of temperature.

A major drawback of this model is that we haven’t incorporated any interesting dynamics such as debt, inflation, asymmetric transactions, or more complex financial entities. Some of these have been explored in a papers by Yakovenko and Caticha, to name a few. Since economics is not a zero sum game, and wealth can be generated by both parties during a transaction, it would be interesting to see how any of these types of dynamics end up affecting the equilibrium wealth distribution and consequently the Gini Index.

Hopefully in a future post I’ll explore this model a little more, and incorporate some of the aforementioned aspects of more realistic economies. We’ll see how much time I can actually squeeze into this econophysics sideproject… till next time!

Modeling Predator-Prey Spatial Dynamics

Modeling Predator-Prey Spatial Dynamics

The dynamics of two populations, one predator the other prey, are modeled using the Lotka-Volterra equations. In this post I’ll discuss why these equations have the form they do in a fully mixed model, and look at generalizations to incorporate spatial dynamics.

Let’s start with basic growth. If you have a large population then it will have more children than a small population. The rate of growth of the population is proportional to its size. Denoting the number of individuals in a population i at time t as N_i (t) we can write

https://i2.wp.com/mathurl.com/hsvo494.png

Here the growth rates are functions of the other populations. We want this because the size of a predator population should affect the death rate of a prey population, and the size of the prey population should affect the birth rate of a predator population. For a two population model we have then:

https://i2.wp.com/mathurl.com/jzzflrv.png

https://i1.wp.com/mathurl.com/h5djb6e.png

For a prey population the alpha is positive (base birth rate higher than death rate) and the beta is negative (death rate increased by presence of predator), and for a predator population we have the reverse (base death rate higher than birth rate, birth rate increased by presence of prey). This basic system has oscillatory dynamics: the predator population stays low until there is a surge in prey population. The surge in the predator population forces down the prey population, which in turn leads to a collapse of the predator population allowing the prey to grow again.

Now how do we add spatial dynamics?

First we allow the population functions to be functions of space, and define population densities so that the number of individuals in a region \Delta \vec r around some point \vec r is N_i(\vec r, t|\Delta \vec r)=n_i(\vec r,t)\Delta\vec r. What kind of dynamics do we wish to incorporate? How about the exploration of the environment, as well as the propensity for predators to chase prey and prey to run away from predators?

https://i0.wp.com/mathurl.com/htetwvt.png

https://i0.wp.com/mathurl.com/z3wc25d.png

With no environmental pressure we might think of the individuals in a population as performing random walks. High population densities will diffuse to regions of lower density, and in an idealized limit there would be individuals everywhere in the environment. Overpopulated lumps dissipate and underpopulated dips fill up. By noting that lumps have negative curvature while dips have positive curvature, this dynamic can be implemented with a Laplacian:

https://i2.wp.com/mathurl.com/zza29m6.png

If this was the only term on the RHS, then we would have the diffusion equation, where the motility, \mu, of the species is playing the role of the diffusivity. It is a positive number, and the larger it is the faster the individuals in the species can move around. Clumping behavior could be modeled by allowing the motility to become negative.

Alright, so what about the coupled spatial dynamics of the two populations: chasing and escaping? What kind of term can model this? Consider the density of both populations at a single point and let’s think about how one of the populations is reacting to the other. If the population is predatory then it will want to move in the direction of increasing prey population. A prey will act in the opposite fashion. In an interval of time the individuals at a point will shift to a new point that is determined by the gradient of the other population, hence:

https://i1.wp.com/mathurl.com/h6zejvc.png

Chasing and escaping behavior can be modeled with gradient couplings! Here \nu is the coupling motility, and is positive for chasers (predators) and negative for escapees (prey).  Our spatial dynamics are captured in the operator:

https://i2.wp.com/mathurl.com/jdut7b8.png

In the case of two populations we now have:

https://i0.wp.com/mathurl.com/grsy9fl.png

These equations are a great starting place to start looking at the spatial dynamics of populations, and it is pretty obvious about how one can go about and generalize them to to include more complicated behavior.

Just to give you a sense of how the spatial predator and prey dynamics evolve over time I coded a slightly more complicated model than the one just written (included a predator-predator fighting component):

Also, make sure to check out the artwork of Alex Solis if you dig adorable versions of predators and prey.