• Properties of Information
  • Prefix-free codes
    • Kraft’s Inequality
    • Lower bound for L
    • An Upper Bound for L
  • References
  • Home
  • Posts
  • Notes
  • Books
  • Author
🇫🇷 fr 🇨🇳 zh 🇮🇳 ml

Nathaniel Thomas

Entropy from First Principles

January 1, 2025

I find entropy to be extremely fascinating. But, matching the formula ∑pi​logpi​1​ to its “intuitive” explanations related to prefix free codes and information content is not obvious. Here, I want to go over a couple ways to independently arrive at the idea.

Properties of Information

Suppose we want to define a function I, which represents the information content of an event. Abstracting away the specifics of the event, one measure we could use to compare one event to another is their probability of occurrence. So, I could be a mapping from a probability p∈[0,1] to R. Given this framing, the following requirements are sensible:

  1. I(1)=0. If an event definitely occurs, it’s not very interesting and gives us little information.

  2. I should be continuous and monotonically decreasing on [0,1]. A more common event is less informative.

  3. Two independent events with probabilities p and q should have information I(p)+I(q)

The last requirement is the most telling. By definition, the probability of two independent events occuring is pq. So

I(pq)=I(p)+I(q).

Since the function must be continuous, this only holds for

I(p)=clogp.

If we want I to be monotonically decreasing,

dpdI​=c/p

must be negative. Since p is positive, c must be negative. Letting c′=−c

I(p)=c′logp1​

Since loga​b=logalogb​, where the denominator is a constant, we can think of c′ as encoding the base of the logarithm. For convenience, we let c′ be 1, and let log denote the base–2 logarithm.

Entropy is simply the expected value of I, over a distribution p=(p1​,p2​,…,pn​)

H(p)=i=1∑n​pi​logpi​1​

We also assume that H(0)=0log01​=0, motivated by continuity.

For example, consider the Bernoulli random variable Bp​, which takes the value 1 with probability p, and 0 with probability 1−p. If we plot its entropy

Loading...
Plotting code
import numpy as np
import plotly.graph_objects as go

# Function to calculate the entropy of a Bernoulli variable
def bernoulli_entropy(p):
    return -p * np.log2(p) - (1 - p) * np.log2(1 - p)

# Generate values for p from 0 to 1
p_values = np.linspace(0.01, 0.99, 100)
entropy_values = bernoulli_entropy(p_values)

# Create the plot
fig = go.Figure()

# Add the entropy trace
fig.add_trace(go.Scatter(x=p_values, y=entropy_values, mode='lines', name='Entropy', line=dict(color='red')))

# Update layout for dark mode
fig.update_layout(
    title='Entropy of a Bernoulli Random Variable',
    xaxis_title='p',
    yaxis_title='Entropy',
    template='plotly_dark'
)

# Save the plot to an HTML file
fig.write_html("bernoulli_entropy_plot.html")

we see that it is maximized when the distribution is uniform, and minimized when it is almost deterministic.

Prefix-free codes

Suppose we have a set of symbols X={X1​,…,XM​} that we want to transmit over a binary channel. We construct the channel such that we can send either a 1 or a 0 at a time. We want to find an optimal encoding scheme for X, with one requirement: it is prefix-free.

Let’s define an encoding function f:X→{0,1}+, which maps symbols to binary strings of length ≥1. We say an encoding is prefix-free if no codeword is a prefix of another. For example, {0,01} is not prefix free because 0 is a prefix of 01. However, {0,10} is.

A prefix free code implies that the code is uniquely decodable without additional delimiters in between symbols, which is a desirable property.

We also notice that a binary prefix code is uniquely defined by a binary tree:

Binary tree of prefix code
Source: https://leimao.github.io/blog/Huffman-Coding

where the root-to-symbol path determines the codeword, and symbols are always leaves. Convince yourself that any construction like this results in a prefix code.

We will now show that the expected codeword length L of any prefix code over X is bounded by

H(X)≤L<H(X)+1.

where X a random variable that takes on values from the set X with probabilities (p1​,…,pn​). Most importantly, we see that the entropy of X is a lower bound for how much a distribution can be compressed, or equivalently, how much information it contains.

Kraft’s Inequality

Visualization of kraft inequality proof binary tree
Kraft’s Inequality Example Tree

Suppose that li​ is the length of the ith codeword. If the code is prefix-free:

i=1∑M​2−li​≤1

Proof:

Let lmax​ be the length of the longest codeword. We notice that:

  1. There are at most 2lmax​ nodes at level lmax​
  2. For any codeword of length li​, there are 2lmax​−li​ descendents at level lmax​.
  3. The sets of descendents of each codeword are disjoint (since one codeword is never a descendent of another)

These imply

⟹​i∑​2lmax​−li​≤2lmax​i∑​2−li​≤1.​

Why ≤ instead of equality? Because it is possible that a node at level lmax​ is not a descendent of any codeword (consider the tree of the code {10,11})!

Lower bound for L

Now, let’s consider the expected codeword length

L=i∑​pi​li​

We will show that entropy is a lower bound for L, or

H(X)≤L⟺L−H(X)≥0

Proof:

L−H(X)​=i∑​pi​li​+i∑​pi​logpi​=−i∑​pi​log2−li​+i∑​pi​logpi​=−i∑​pi​log2−li​+i∑​pi​logpi​+i∑​pi​log(j∑​2−lj​)−i∑​pi​log(j∑​2−lj​)=i∑​pi​log2−li​/∑j​2−lj​pi​​−i∑​pi​log(j∑​2−lj​)=DKL​[p∣∣q]+logc1​≥0​

Where the final inequality is due to 1) KL divergence being non-negative and 2) to c≤1 due to Kraft’s inequality. One thing to note is that if li​=−logpi​, L=H(X), the theoretical minimum. The reason we cannot always achieve this is because −logpi​ need not be an integer, which is obviously required.

An Upper Bound for L

Notice that it is possible to construct a prefix code with lengths

li​=⌈−logpi​⌉

since they satisfy Kraft’s Inequality:

i∑​2−li​≤i∑​2logpi​=1

By the definition of the ceiling function

−logpi​≤li​<−logpi​+1

Taking the expectation over p, we get

​−∑pi​logpi​≤∑pi​li​<−∑pi​logpi​+1⟹H(X)≤L<H(X)+1​

This shows that H(X)+1 is a good upper bound for L!

In summary, entropy is a lower bound, and a reasonable estimate, for the average number of bits it takes to encode a distribution as a prefix code.

References

  • Alon Orlitsky’s ECE 255A Lectures, UCSD
  • Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory. Wiley-Interscience.

←
Interactive Gaussian Mixture Models

back to top