The Kelly Criterion from Shannon Information Theory

March 27, 2026|

In 1956, John Kelly Jr. at Bell Labs published a remarkable paper connecting two apparently unrelated problems: the reliable transmission of information over noisy channels, and the optimal sizing of repeated bets. His central result was that a gambler with access to a noisy private wire to a racetrack can grow capital at an exponential rate equal to the Shannon mutual information between the source and the received signal. The optimal strategy, now called the Kelly criterion, prescribes exact fractions of wealth to wager on each outcome.

This post derives Kelly's result from first principles. We build the information-theoretic foundations (§1), formulate the gambler's optimization problem (§2), solve it via Karush-Kuhn-Tucker conditions (§3), prove the connection to mutual information (§4), specialize to binary prediction markets (§5), and discuss application to algorithmic trading (§6).

Conventions. All logarithms are base 2 unless otherwise noted, so that information is measured in bits. We write conditional probabilities as , , and use the geometric language of the probability simplex throughout. Kelly's 1956 paper uses a slash in place of the conditioning bar (e.g., for ); an appendix restates the key results in his original notation for readers following the paper directly.

1. Shannon's Communication Theory

Shannon's 1948 theory models communication as a probabilistic process. A source emits symbols from a fixed, finite set of possibilities: letters in a language, bits in a wire protocol, horses in a race. The restriction to finitely many symbols is not a limitation for our purposes. Every prediction market resolves to one of finitely many outcomes (typically just YES or NO), and every horse race has finitely many horses. The mathematical structure we need begins with this finite set and the space of probability distributions over it.

1.1. The probability simplex and entropy

Let be a finite set of symbols (the alphabet). The space of all probability distributions on is the probability simplex

an -dimensional manifold with boundary, embedded in . Each point assigns probability to symbol .

Definition (Shannon entropy).

The entropy of a distribution is

with the convention (justified by continuity, since ).

Entropy quantifies uncertainty: when is a point mass (the outcome is deterministic), and is maximized when is uniform. We prove this bound precisely as a corollary of the next result. For the binary case , the entropy is a function of a single parameter , peaking at bit when :

1.2. Kullback-Leibler divergence

Definition (Kullback-Leibler divergence).

For distributions with absolutely continuous with respect to (i.e., ), the KL divergence from to is

Geometrically, is a Bregman divergence on generated by the negative entropy functional . It is not symmetric and does not satisfy the triangle inequality, so it is not a metric. It does, however, define a canonical notion of directed distance on the simplex. Its Hessian at a point recovers the Fisher information metric , the unique (up to scale) Riemannian metric on the interior of invariant under sufficient statistics ^[4].

Theorem (Gibbs' inequality).

For any distributions ,

with equality if and only if .

Proof. Since is strictly convex, Jensen's inequality gives

Equality in Jensen's inequality holds if and only if is constant -almost surely. Since both and sum to 1, the constant must be 1, forcing . ∎

Corollary (Entropy bounds).

For any ,

The lower bound is achieved if and only if is a point mass. The upper bound is achieved if and only if is the uniform distribution .

Proof. For the lower bound: each term since , with equality for all if and only if is a point mass.

For the upper bound: let . By [gibbs],

so , with equality if and only if . ∎

1.3. Communication channels

Definition (Discrete memoryless channel).

A discrete memoryless channel consists of an input alphabet , an output alphabet , and a stochastic map defined by the transition probabilities

Given a source distribution , the joint distribution, output marginal, and posterior are:

In geometric language, the channel induces a pushforward on the output simplex , and the Bayesian posterior defines a map by . The posterior map sends each received symbol to a point on the source simplex, encoding what the receiver has learned about after observing .

1.4. Conditional entropy and mutual information

Definition (Conditional entropy).

The conditional entropy of given is

where is the entropy of the posterior distribution .

Definition (Mutual information).

The mutual information between and is

Proposition (Entropy decomposition of mutual information).

Proof. Starting from [eq:mi-eq] and using :

The first sum is by [eq:cond-entropy-eq]. For the second sum, marginalize over :

Combining: . ∎

Proposition (Divergence decomposition of mutual information).

Proof. From the proof of [mi-entropy]:

where we used to factor the outer sum. ∎

Corollary (Non-negativity of mutual information).

, with equality if and only if and are independent.

Proof. By [mi-kl], is a convex combination of KL divergences. Each term satisfies by [gibbs], so . Equality holds if and only if for all with , which by [gibbs] requires for all . This is precisely the condition that and are independent. ∎

Remark (Geometric interpretation).

On the simplex , the posterior map traces out a cloud of points around the prior . The mutual information is the average KL divergence from to these posterior points, weighted by the output marginal. A noiseless channel sends to a vertex of the simplex (a point mass), maximizing the divergence at . A completely noisy channel satisfies for all , collapsing the divergence to zero. The mutual information thus measures how far, on average, the posterior moves from the prior on the simplex.

The visualization below makes this concrete for a binary channel. The prior (gold circle) sits on the simplex . As the channel noise decreases, the two posteriors (blue dots, one per received symbol) spread further from the prior, and the mutual information grows.

2. The Gambler's Problem

Consider a sequence of independent horse races, each with horses. In each race, horse wins with probability . The track pays odds of to 1 on horse : a unit bet returns if wins and zero otherwise.

Before each race, the gambler receives a symbol through a noisy channel with transition probabilities . After observing , the gambler allocates a fraction of current capital to each horse , retaining a holdback

If horse wins, the gambler's wealth is multiplied by .

Definition (Exponential growth rate).

The exponential growth rate of the gambler's capital is the expected log-return per race:

By the strong law of large numbers applied to the i.i.d. sequence of log-returns, almost surely as . After races the gambler's wealth satisfies , so maximizing maximizes the asymptotic exponential growth rate. The Kelly criterion is the strategy that maximizes subject to the budget constraint [eq:budget-eq] for each .

For a binary bet (, market price ), the growth rate as a function of bet fraction is strictly concave, guaranteeing a unique optimum. The gold dot sweeps across all possible bet sizes. Betting too little wastes edge; betting too much inverts the growth rate, leading to ruin. The shaded region marks the zone where .

3. Optimal Allocation

3.1. Decomposition per received symbol

Proposition (Per-symbol decomposition).

The growth rate decomposes as , where

Since and the variables for distinct values of are disjoint, maximizing is equivalent to independently maximizing each .

Proof. Write and factor the outer sum over . The budget constraint [eq:budget-eq] couples only variables sharing the same , so the per- subproblems are independent. Since each , the global maximum of is achieved by maximizing each separately. ∎

3.2. The Kelly theorem

By [decomposition], fix a received symbol and write , , for brevity. The subproblem is:

Substituting and differentiating with respect to , the chain rule gives

where is the Kronecker delta (the arises because increasing decreases ). Therefore:

where the constant arises from differentiating .

Theorem (Kelly optimal allocation).

Let be the probability that outcome occurs and the odds. Define the active set and its complement . Write

The unique growth-rate-maximizing allocation is:

An outcome belongs to the active set if and only if its expected return exceeds the holdback: .

Proof. The growth rate is strictly concave in the (it is a positively weighted sum of logarithms of affine functions), so the Karush-Kuhn-Tucker conditions are necessary and sufficient. Since , dividing [eq:partial-eq] by yields:

where is independent of .

Step 1 (Express allocations in terms of ). From the active-set condition [eq:active-eq]:

giving .

Step 2 (Determine ). Decompose the sum defining over the active and inactive sets:

so , equivalently .

Step 3 (Apply the budget constraint). From :

Substituting :

so .

Step 4 (Final formulas). Substituting : and for . The membership condition is equivalent to , i.e., . ∎

Remark (Side information).

With side information, [decomposition] reduces the problem to one instance of [kelly-optimal] per received symbol , with in place of . The optimal strategy is and for , where and .

Remark (Determining the active set).

The active set and the holdback are mutually dependent: while depends on through and . In practice, sort outcomes by in decreasing order and add them to one at a time, recomputing at each step, until the membership condition fails.

4. Growth Rate as Mutual Information

Definition (Fair odds).

The odds are fair when , so that the reciprocal odds form a probability distribution on outcomes. In particular, gives true fair odds reflecting the source distribution.

Theorem (Kelly's fundamental result).

Under true fair odds , the maximum growth rate with side information equals the mutual information:

Proof. With , the full sum . For each received symbol , consider the candidate strategy , for all .

Feasibility. The budget constraint is satisfied: . Each .

Optimality. Evaluate the partial derivative [eq:partial-eq] at this candidate:

Since is concave and all partial derivatives vanish simultaneously, is the global maximizer.

Growth rate. The optimal per-symbol growth rate is:

Averaging over received symbols:

where the last equality is [mi-kl]. ∎

Geometrically, the optimal Kelly strategy maps each received symbol to the corresponding posterior on the simplex , and the growth rate measures the expected KL divergence traversed from the prior to the posterior . The gambler's capital grows precisely because the private signal moves the posterior away from the prior, and this displacement, averaged over the output distribution, is the mutual information.

Corollary (No information, no growth).

Without side information (i.e., when and are independent), the maximum growth rate under true fair odds is . No betting strategy can achieve positive long-run growth against a fair market without an informational edge.

Proof. When and are independent, for all , so by [mi-nonneg]. ∎

5. Binary Prediction Markets

We specialize to two outcomes: YES () and NO (). The gambler believes while the market prices a YES contract at , giving odds and . These are fair odds since .

Corollary (Binary Kelly criterion for prediction markets).

For a binary prediction market with true probability and market price , with (positive edge on YES), the Kelly-optimal fraction of capital to bet on YES is

and the resulting growth rate is

Proof. Apply [kelly-optimal] with active set , so and . Then

For the growth rate, the wealth multiplier when YES wins is . Combining over a common denominator:

When NO wins, the multiplier is . Therefore:

This is the KL divergence from the market distribution to the gambler's true distribution . ∎

Remark (Edge on NO).

When , the edge is on NO. By symmetry, the optimal fraction to bet on NO is , and the growth rate is again . In either direction, the Kelly growth rate for a binary prediction market is the KL divergence between the gambler's beliefs and the market's implied probabilities, measuring the information-theoretic "distance" between truth and market on the one-dimensional simplex .

The visualization below shows for even-money odds () as the true probability sweeps from 0.52 to 0.80. The gold dot marks the optimal fraction , and the faded traces show snapshots at intermediate values of .

The next figure illustrates the binary Kelly growth rate as a KL divergence on the simplex . The top panel shows the market price (hollow) and the gambler's true probability (gold) on the simplex. The bottom panel plots as a function of .

Finally, we simulate the long-run behavior. Four wealth trajectories are drawn for each of three strategies: full Kelly (gold), half Kelly (blue), and double Kelly (coral). Full Kelly maximizes asymptotic growth but with high variance. Half Kelly sacrifices some growth for smoother paths. Double Kelly overbets, producing frequent ruin despite the positive edge. Dashed lines show the theoretical expected growth rate .

6. Application to Algorithmic Trading

The Kelly framework extends naturally from prediction markets to algorithmic trading systems, though several practical modifications are essential.

6.1. From discrete bets to continuous sizing

In practice, a trading system faces a stream of opportunities rather than a single repeated bet. Each opportunity has an estimated edge and market-implied probability . The Kelly fraction for opportunity is , and the capital allocated is where is current wealth.

Two complications arise immediately. First, the true probability is never known exactly. The gambler works with an estimate that carries both systematic bias and estimation variance. Second, multiple positions may be open simultaneously, introducing correlation structure that the single-bet Kelly formula ignores.

6.2. Fractional Kelly and estimation error

When the estimate is noisy, full Kelly is too aggressive. To see why, consider the expected growth rate under misspecification. If the true probability is but the gambler bets as though it were , the realized growth rate is

which is strictly less than whenever . The loss from overestimating edge is asymmetrically worse than the loss from underestimating it, because the logarithm penalizes losses more than it rewards gains. This asymmetry motivates fractional Kelly: bet a fraction of the full Kelly amount. Common choices are (half Kelly) or (quarter Kelly).

The fractional Kelly strategy trades expected growth for reduced variance. With fraction , the growth rate becomes approximately , achieving of maximum growth at half Kelly while cutting the variance of log-returns in half ^[5].

6.3. Portfolio Kelly and correlation

When multiple positions are open simultaneously, the correct generalization replaces the scalar Kelly fraction with a vector problem. Given simultaneous opportunities with mean excess return vector and covariance matrix , the growth-rate-maximizing portfolio weight vector is

the familiar Markowitz mean-variance portfolio, which turns out to be the continuous-time limit of the Kelly criterion ^[6]. For a trading system running multiple concurrent positions on Polymarket or similar venues, this means accounting for correlation between outcomes. Correlated positions effectively concentrate risk, requiring smaller individual allocations than the independent-bet formula suggests.

6.4. Practical guidelines

For an algorithmic system implementing Kelly sizing:

Estimate conservatively. Calibrate on out-of-sample data using proper scoring rules (Brier score, log score). Prefer models that are well-calibrated over those that maximize accuracy.
Use fractional Kelly. Start with and increase only as confidence in the probability estimates grows. The growth rate curve is flat near the optimum (see Figure 1), so undersizing costs far less than oversizing.
Respect the budget constraint. The total allocation across all active bets must satisfy . When many opportunities appear simultaneously, the constraint forces prioritization by edge quality , exactly as [active-set-remark] describes.
Rebalance on new information. Each new observation updates the posterior , changing the Kelly fraction. The capital path is not set-and-forget but adapts continuously as the information channel evolves.
Account for fees and slippage. Transaction costs reduce the effective odds . If the market charges a fee , replace with in all formulas. Small edges can be entirely consumed by fees, leaving no profitable bet.

7. Conclusion

Kelly's 1956 result reveals a deep structural identity: the maximum rate at which a gambler can grow capital through repeated wagers is exactly the mutual information between the source of uncertainty and the signal available to the bettor. This connection is not a loose analogy. It is a theorem, derived here from the KKT conditions of a constrained concave optimization and the divergence decomposition of mutual information.

The proof passes through three stages. First, the growth rate decomposes into independent subproblems, one per received symbol. Second, solving each subproblem via the active-set conditions yields the closed-form allocation . Third, under fair odds, the optimal growth rate for each received symbol reduces to a KL divergence from the prior to the posterior, and averaging recovers the mutual information.

For binary prediction markets, the result simplifies further: the Kelly fraction is and the growth rate is the KL divergence . No betting system can beat this rate, and no positive rate is achievable without an informational edge. These are not heuristics but consequences of the concavity of the logarithm and the geometry of the probability simplex.

In practice, uncertainty in the probability estimates, correlated positions, and transaction costs all require modifications. Fractional Kelly provides a principled way to trade growth for robustness, and the portfolio generalization handles simultaneous bets. The core message endures: information is capital, and the rate of capital growth is bounded by the rate of information.

Appendix: Kelly's Original Slash Notation

Expand to see the key results restated in Kelly's 1956 notation

Kelly's paper writes conditional probabilities with a slash rather than a conditioning bar: in place of , and in place of . The full notation dictionary is:

This post	Kelly (1956)	Meaning
		Source probability
		Channel transition
		Output marginal
		Posterior
		Joint probability
		Bet fraction
		Conditional entropy

Growth rate (Kelly's Eq. 6). In slash notation:

Budget constraint. For each received symbol :

Optimal allocation (Kelly's Eq. 10). Writing for the posterior, the KKT conditions yield:

Proof sketch in slash notation. The partial derivatives of (Kelly's Eq. 7) are:

where . Setting the active-set equations to zero:

From the active-set condition, , giving . Summing the definition of over and separately yields . The budget constraint then gives , so .

Kelly's fundamental result (Kelly's Eq. 11). Under fair odds , the optimal strategy is , , and the growth rate is:

The mutual information decomposes as , where is the conditional entropy in slash notation.

References

Kelly, J. L. Jr. (1956). A New Interpretation of Information Rate. Bell System Technical Journal, 35(4), 917–926. DOI
Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3), 379–423. DOI
Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory, 2nd ed. Wiley.
Amari, S. and Nagaoka, H. (2000). Methods of Information Geometry. AMS.
MacLean, L. C., Thorp, E. O., and Ziemba, W. T. (2011). The Kelly Capital Growth Investment Criterion. World Scientific. DOI
Thorp, E. O. (2006). The Kelly Criterion in Blackjack, Sports Betting, and the Stock Market. In Handbook of Asset and Liability Management, Vol. 1. DOI