Markov Chain Self-Guided Course
This document is a self-guided course on Markov chains. It is organized into four parts: conceptual foundations, first Rust implementations, text generation, and deeper theory. Each section is either a reading lesson or a hands-on Rust programming exercise. Sections marked ๐ง are stubs whose full content is tracked in an nbd ticket โ follow the ticket ID to find the detailed learning objectives and instructions.
Table of Contents
Part 1 โ Foundations
Part 2 โ First Implementation
Part 3 โ Text Generation
- Text Generation with Markov Chains
- Exercise 3 โ Bigram Text Generator
- Exercise 4 โ N-gram Generalization
Part 4 โ Deeper Concepts
Part 1 โ Foundations
1. What Is a Markov Chain?
A Markov chain is a mathematical model describing a sequence of events where the probability of each event depends only on the state reached in the previous event โ not on the full history. This โmemorylessโ property is called the Markov property. You will learn where Markov chains appear in the real world and develop intuition for why the memoryless property is both a useful simplification and a meaningful assumption.
The core idea: only now matters. Imagine you are tracking todayโs weather. Intuitively, you might think yesterdayโs weather, the week before, and the entire season all influence what tomorrow will bring. A Markov chain says: forget all of that. Given that you know todayโs weather, knowledge of every earlier day adds nothing to your prediction of tomorrow. The present state captures everything relevant from the past. This is the Markov property โ colloquially โmemorylessnessโ โ and it is a surprisingly powerful modelling assumption.
Formally, let Xโ, Xโ, Xโ, โฆ be a sequence of random variables each taking values in some set of states. The sequence is a Markov chain if, for every time step n and every state s:
P(Xโโโ = s | Xโ, Xโ, โฆ, Xโ) = P(Xโโโ = s | Xโ)
The left-hand side conditions on the entire history up to step n. The right-hand side conditions on only the current state Xโ. The equation says these two quantities are always equal: no matter how you got to the current state, your distribution over the next state is the same.
A worked example โ the weather model. Suppose a city has two kinds of days: Sunny and Rainy. You observe that:
- After a Sunny day, there is an 80 % chance of another Sunny day and a 20 % chance of Rain.
- After a Rainy day, there is a 40 % chance of Sun and a 60 % chance of Rain.
Under the Markov assumption, these two rules are all you need. If today is Sunny, the chance of rain tomorrow is 20 % โ regardless of whether the preceding week was a drought or a monsoon. The model is simple because it deliberately ignores deep history, and it is useful precisely because that history often adds little predictive power once you know the current state.
Where Markov chains appear in the real world. Once you recognise the Markov property you will spot it everywhere:
- Board games. In Snakes and Ladders the only thing that matters is which square you are on right now. The sequence of rolls that brought you there is irrelevant โ your future depends only on your current position.
- Web surfing (PageRank). Googleโs original PageRank algorithm modelled a hypothetical random web surfer who, at each page, clicks a link chosen uniformly at random. Where the surfer goes next depends only on the current page, not the path taken to reach it. The long-run fraction of time spent at each page is the pageโs rank.
- Genetics. In simple population-genetics models, the number of copies of a gene in the current generation determines the distribution of copies in the next generation. The frequencies in prior generations, once summarised in the current count, carry no additional information.
- Text generation. Given the last word (or last few words) of a sentence, the probability of the next word can be estimated from a corpus. The full sentence history is ignored โ only the recent context matters. Sections 6โ8 of this course build exactly this kind of model in Rust.
Why the Markov property is a useful assumption. Real systems are almost never perfectly memoryless โ yesterdayโs weather genuinely does carry a whisper of information beyond todayโs. So why use Markov models? Because they strike a remarkable balance between tractability and expressiveness. A model that conditions on the entire history is usually intractable; one that ignores history entirely is too crude. The Markov property is the sweet spot: it allows rigorous mathematical analysis (stationary distributions, convergence theorems, efficient simulation) while still capturing the essential dynamics of many real processes. When the assumption is too crude, you can extend it by enlarging the state space โ for instance, tracking the last two days of weather instead of one โ and the Markov property holds again at that richer level of description.
2. States and Transitions
Every Markov chain consists of a set of states and a rule for how the system moves between them. The set of all possible states is called the state space, commonly written S. In this course the state space is always discrete โ it is either finite (e.g., {Sunny, Rainy}) or countably infinite (e.g., the non-negative integers {0, 1, 2, โฆ}, as in a random walk on the number line). Discrete state spaces are by far the most common setting for introductory Markov chain theory and for the kinds of simulations you will build in Rust.
A transition is a single step from one state to another. At each discrete time step the chain occupies some state i โ S, then moves to state j โ S with probability P(i โ j). Two things matter: (1) transitions are directed โ going from i to j and going from j to i are distinct transitions with potentially different probabilities; and (2) every transition has an associated probability, and the probabilities of all transitions out of a given state must sum to 1. A transition from i back to i โ a self-loop โ is perfectly valid and simply means the chain stays put with some nonzero probability.
State-transition diagrams make these rules visual. Draw one node for each state and one labelled arrow for each possible transition; the label is the transition probability. For the two-state weather model from Section 1 the diagram looks like this:
0.8 0.6
โญโโโโโโโโโโฎ โญโโโโโโโโโโฎ
โ โ 0.2 โ โ
โผ โ โโโโโโโโโโโโโโบ โ โผ
โโโโโโโโโ โโโโ โโโโ โโโโโโโโโ
โ SUNNY โ โ RAINY โ
โโโโโโโโโ โโโโ โโโโ โโโโโโโโโ
โ 0.4 โ
โโโโโโโโโโโโโโโโโโโโ
The self-loops (Sunny โ Sunny with 0.8, Rainy โ Rainy with 0.6) show that the chain can stay in the same state. The cross-arrows capture the transitions between states. Every row of outgoing arrows sums to 1: from Sunny, 0.8 + 0.2 = 1; from Rainy, 0.4 + 0.6 = 1.
States in a Markov chain differ in how they relate to the chainโs long-run behaviour. Three categories matter most:
- An absorbing state is one you can never leave: once the chain enters it, it stays there forever. A self-loop with probability 1 is the defining feature. A simple example is a gambler who reaches ยฃ0 in a gambling game with no credit โ they are stuck.
- A transient state is one you are not guaranteed to return to. If you leave a transient state there is some positive probability of never coming back. In many models early states are transient โ the chain passes through them and moves on.
- A recurrent state (also called a persistent state) is one you are guaranteed to return to eventually, with probability 1. In the weather model both Sunny and Rainy are recurrent: no matter which state you are in, you will eventually visit both states again and again.
The distinction between transient and recurrent states determines the chainโs long-run behaviour. A chain that only visits transient states will eventually leave them forever; a chain trapped in recurrent states will cycle among them indefinitely. Understanding this classification is the foundation for studying stationary distributions, which Section 9 covers in detail.
3. Transition Probabilities and Matrices
The rules governing how a Markov chain moves are captured in a transition matrix P, where P[i][j] is the probability of moving from state i to state j in one step. This section covers how to construct P, the constraints it must satisfy (rows sum to 1), and how to use matrix multiplication to compute multi-step probabilities.
Defining the transition matrix. Label the states 0, 1, โฆ, nโ1. The transition matrix P is an n ร n array where entry P[i][j] gives the probability of moving to state j on the very next step, given that you are currently in state i. Because these are probabilities of mutually exclusive, exhaustive outcomes (from state i you must go somewhere), every row must sum to exactly 1 and every entry must lie between 0 and 1 inclusive. A matrix satisfying these two constraints is called a stochastic matrix (or row-stochastic matrix). Each row is itself a probability distribution over the next state.
The stochastic-matrix constraints, stated precisely. For an n-state chain:
P[i][j] >= 0 for all i, j
sum_j P[i][j] = 1 for every row i
A zero entry means the transition is impossible; a one means it is certain. Columns have no such constraint โ column sums need not equal 1.
Multi-step probabilities via matrix multiplication. Suppose you start in state i at time 0. After one step the probability of being in state j is P[i][j]. After two steps you pass through some intermediate state k, so:
P^2[i][j] = sum_k P[i][k] * P[k][j]
This is exactly the (i, j) entry of P ร P = Pยฒ. In general, the probability of going from state i to state j in exactly k steps is the (i, j) entry of P^k. If you encode your current uncertainty as a row vector ฯโ โ a probability distribution over all states โ then after k steps your updated distribution is:
ฯ_k = ฯโ ยท P^k
Each right-multiplication by P advances the clock one tick and blends probabilities according to the transition rules.
Worked example โ a two-state weather chain. Consider a model with two states: Sunny (state 0) and Rainy (state 1). From data:
- If today is Sunny, tomorrow is Sunny with probability 0.8 and Rainy with probability 0.2.
- If today is Rainy, tomorrow is Sunny with probability 0.4 and Rainy with probability 0.6.
Writing this as a matrix:
Sunny Rainy
Sunny [ 0.8 0.2 ]
Rainy [ 0.4 0.6 ]
Row 0 sums to 1.0; row 1 sums to 1.0. All entries are non-negative. P is a valid stochastic matrix.
One step. Start with certainty in Sunny: ฯโ = [1, 0].
ฯโ = ฯโ ยท P = [1, 0] ยท [[0.8, 0.2], [0.4, 0.6]] = [0.8, 0.2]
Tomorrow: 80 % Sunny, 20 % Rainy.
Two steps. Apply P again:
ฯโ = ฯโ ยท P = [0.8, 0.2] ยท [[0.8, 0.2], [0.4, 0.6]]
= [0.8*0.8 + 0.2*0.4, 0.8*0.2 + 0.2*0.6]
= [0.72, 0.28]
Equivalently, compute Pยฒ once and read off the row for state 0:
P^2 = [[0.8*0.8 + 0.2*0.4, 0.8*0.2 + 0.2*0.6],
[0.4*0.8 + 0.6*0.4, 0.4*0.2 + 0.6*0.6]]
= [[0.72, 0.28],
[0.56, 0.44]]
Pยฒ[0] = [0.72, 0.28] โ matching the step-by-step result. Starting from Rainy gives Pยฒ[1] = [0.56, 0.44]; the two rows are already noticeably closer to each other than the original [0.8, 0.2] vs [0.4, 0.6]. As k grows, both rows converge toward the same limiting vector โ the stationary distribution that Section 9 analyses in depth. The matrix-multiplication perspective makes this convergence precise and computable.
Part 2 โ First Implementation
4. Exercise 1 โ Weather Model
Goal: Build a two-state Markov chain in Rust that models daily weather as either Sunny or Rainy, driven by a transition matrix, and simulate 30 days of weather.
Setup
Create a new Cargo binary project and add the rand crate:
cargo new weather-chain
cd weather-chain
cargo add rand
Starter Code
Replace the contents of src/main.rs with the following skeleton. Do not change the struct layout or function signatures โ your task is to fill in the todo!() bodies and write main.
use rand::Rng;
#[derive(Debug, Clone, Copy, PartialEq)]
enum Weather { Sunny, Rainy }
struct WeatherChain {
/// transition[current][next] = probability
transition: [[f64; 2]; 2],
}
impl WeatherChain {
fn step(&self, current: Weather, rng: &mut impl Rng) -> Weather { todo!() }
fn simulate(&self, start: Weather, steps: usize, rng: &mut impl Rng) -> Vec<Weather> { todo!() }
}
fn main() { todo!() }
Step 1 โ Index conversion
step needs to index into self.transition using the current state, and later convert an index back to a Weather value. Add two helper methods to the Weather enum:
fn index(self) -> usizeโ returns0forSunny,1forRainyfn from_index(i: usize) -> Selfโ returnsSunnyfor0,Rainyfor anything else
A simple match expression handles both. These are the only tools you need to bridge the enum and the matrix.
Step 2 โ Implement WeatherChain::step
step must sample the next state from the probability row for the current state. The technique is a cumulative-probability walk:
- Retrieve the transition row:
let row = self.transition[current.index()]; - Draw a uniform float in [0, 1):
let r: f64 = rng.gen(); - Walk the row, accumulating probability. When the running total first exceeds
r, return that state.
For the two-state case this reduces to a single comparison โ if r < row[0] return Sunny, otherwise return Rainy โ but implementing the loop works for any number of states and is worth practising.
Add a fallback Weather::from_index(row.len() - 1) after the loop to satisfy the compiler; floating-point rounding can in rare cases leave the loop without returning.
Step 3 โ Implement WeatherChain::simulate
simulate runs the chain for steps transitions, collecting every state visited including the start:
- Allocate a
Vecwith capacitysteps + 1. - Push
startand setcurrent = start. - Loop
stepstimes: callself.step(current, rng), updatecurrent, and push. - Return the
Vec.
The returned slice will have length steps + 1 (the initial state plus one state per step).
Step 4 โ Run the simulation
In main, create a WeatherChain with the two-state matrix from Section 3:
Sunny [0.8, 0.2]
Rainy [0.4, 0.6]
Seed a repeatable RNG with rand::rngs::SmallRng::seed_from_u64(42) (add use rand::SeedableRng;). Simulate 30 steps starting from Sunny, print the resulting sequence, and count how many days were sunny vs rainy.
Expected output structure (exact numbers vary by seed):
[Sunny, Sunny, Rainy, Sunny, ...]
Sunny days: 21 (67.7%)
Rainy days: 10 (32.3%)
Step 5 โ Compare to the stationary distribution
The stationary distribution ฯ satisfies ฯ = ฯP, meaning once the chain reaches it, the distribution no longer changes. For this matrix, solve the two equations:
ฯโ = 0.8ยทฯโ + 0.4ยทฯโ
ฯโ = 0.2ยทฯโ + 0.6ยทฯโ
ฯโ + ฯโ = 1
The first equation simplifies to 0.2ยทฯโ = 0.4ยทฯโ, giving ฯโ = 2ยทฯโ. Substituting into the normalisation constraint: ฯโ = 2/3 โ 66.7%, ฯโ = 1/3 โ 33.3%.
Print the stationary percentages alongside your simulated counts. With only 31 data points the match will be rough; re-run with 1 000 or 10 000 steps to see the empirical frequencies converge.
Reference Solution
Show full solution
use rand::{Rng, SeedableRng, rngs::SmallRng};
#[derive(Debug, Clone, Copy, PartialEq)]
enum Weather { Sunny, Rainy }
impl Weather {
fn index(self) -> usize {
match self {
Weather::Sunny => 0,
Weather::Rainy => 1,
}
}
fn from_index(i: usize) -> Self {
match i {
0 => Weather::Sunny,
_ => Weather::Rainy,
}
}
}
struct WeatherChain {
/// transition[current][next] = probability
transition: [[f64; 2]; 2],
}
impl WeatherChain {
fn step(&self, current: Weather, rng: &mut impl Rng) -> Weather {
let row = self.transition[current.index()];
let r: f64 = rng.gen();
let mut cumulative = 0.0;
for (i, &prob) in row.iter().enumerate() {
cumulative += prob;
if r < cumulative {
return Weather::from_index(i);
}
}
Weather::from_index(row.len() - 1)
}
fn simulate(&self, start: Weather, steps: usize, rng: &mut impl Rng) -> Vec<Weather> {
let mut states = Vec::with_capacity(steps + 1);
states.push(start);
let mut current = start;
for _ in 0..steps {
current = self.step(current, rng);
states.push(current);
}
states
}
}
fn main() {
let mut rng = SmallRng::seed_from_u64(42);
let chain = WeatherChain {
transition: [[0.8, 0.2], [0.4, 0.6]],
};
let states = chain.simulate(Weather::Sunny, 30, &mut rng);
println!("{:?}", states);
let total = states.len() as f64;
let sunny = states.iter().filter(|&&s| s == Weather::Sunny).count();
let rainy = states.len() - sunny;
println!("Sunny days: {} ({:.1}%)", sunny, 100.0 * sunny as f64 / total);
println!("Rainy days: {} ({:.1}%)", rainy, 100.0 * rainy as f64 / total);
println!("Stationary: Sunny โ 66.7%, Rainy โ 33.3%");
}
5. Exercise 2 โ Simulating a Random Walk
Goal: Implement a one-dimensional random walk on the integers (states โN โฆ +N) with reflecting boundaries, then measure the empirical distribution of positions after T steps.
A random walk is one of the simplest Markov chains: the state is a single integer position, and at each step the walker moves left or right according to a fixed probability. Adding reflecting boundaries means the walker cannot escape the interval โ a step that would leave the interval is clamped to the nearest boundary.
Learning objectives
- Model a finite integer state space in Rust
- Implement reflecting (clamped) boundary conditions
- Aggregate many simulation runs into a histogram
- Visualise the distribution as an ASCII bar chart
- Observe how asymmetric step probabilities skew the stationary distribution
Setup
You can extend the weather-chain project from Exercise 1, or create a fresh binary:
cargo new random-walk
cd random-walk
cargo add rand
You will also need use std::collections::HashMap; at the top of src/main.rs.
Starter Code
Replace the contents of src/main.rs with the following skeleton. Do not change the struct layout or function signatures โ your task is to fill in the todo!() bodies.
use rand::Rng;
use std::collections::HashMap;
struct RandomWalk {
min: i32,
max: i32,
/// prob_right[i] = probability of stepping right from position `min + i as i32`
prob_right: Vec<f64>,
}
impl RandomWalk {
fn step(&self, pos: i32, rng: &mut impl Rng) -> i32 { todo!() }
fn simulate(&self, start: i32, steps: usize, rng: &mut impl Rng) -> Vec<i32> { todo!() }
fn histogram(&self, start: i32, steps: usize, trials: usize, rng: &mut impl Rng)
-> HashMap<i32, usize> { todo!() }
}
fn print_histogram(hist: &HashMap<i32, usize>, min: i32, max: i32, trials: usize) { todo!() }
fn main() { todo!() }
Step 1 โ Constructor and index conversion
The prob_right vec is indexed from 0, but positions range from min to max. Add two helpers to RandomWalk:
fn new(min: i32, max: i32, p: f64) -> Selfโ creates a uniform walk where every position has the same step probabilityp. Fillprob_rightwithvec![p; (max - min + 1) as usize].fn idx(&self, pos: i32) -> usizeโ converts a position to a vec index:(pos - self.min) as usize. Use this everywhere you index intoprob_right.
Step 2 โ Implement RandomWalk::step
step must sample the next position:
- Draw a uniform float
rin [0, 1):let r: f64 = rng.gen(); - Look up the step probability:
let p = self.prob_right[self.idx(pos)]; - Choose direction: if
r < p, move right (pos + 1); otherwise move left (pos - 1) - Apply the reflecting boundary by clamping:
moved.clamp(self.min, self.max)
#![allow(unused)]
fn main() {
let moved = if r < p { pos + 1 } else { pos - 1 };
moved.clamp(self.min, self.max)
}
Clamping is the simplest reflecting rule: a step that would leave the interval is silently held at the boundary. An alternative is strict reflection (a step right from max lands at max - 1), but clamping is sufficient here.
Step 3 โ Implement RandomWalk::simulate and RandomWalk::histogram
simulate runs the chain for steps transitions and collects every position visited:
- Allocate a
Vecwith capacitysteps + 1. - Push
startand setcurrent = start. - Loop
stepstimes: callself.step(current, rng), updatecurrent, push. - Return the vec.
histogram runs many independent trials and records where each trial ends:
- Allocate an empty
HashMap<i32, usize>. - Loop
trialstimes: callself.simulate(start, steps, rng), extract*positions.last().unwrap(), and increment its count with*hist.entry(pos).or_insert(0) += 1. - Return the map.
The histogram answers: after steps steps starting from start, what fraction of the time does the walker end at each position?
Step 4 โ Print an ASCII bar chart
Write print_histogram to display the distribution. For each position from min to max:
- Look up
count = hist.get(&pos).copied().unwrap_or(0). - Scale to a bar:
bar_len = count * bar_width / max_count(usebar_width = 40andmax_count = hist.values().copied().max().unwrap_or(1)). - Print the position, a bar of
#characters, and the percentage.
#![allow(unused)]
fn main() {
let bar: String = "#".repeat(bar_len);
println!("{:4}: {:40} ({:.1}%)", pos, bar, 100.0 * count as f64 / trials as f64);
}
Step 5 โ Compare symmetric vs asymmetric walks
In main, run two experiments. Seed a repeatable RNG with SmallRng::seed_from_u64(42) (add use rand::{SeedableRng, rngs::SmallRng};).
Symmetric walk (p = 0.5, range โ5 โฆ 5):
#![allow(unused)]
fn main() {
let walk = RandomWalk::new(-5, 5, 0.5);
let hist = walk.histogram(0, 200, 10_000, &mut rng);
print_histogram(&hist, -5, 5, 10_000);
}
Asymmetric walk (p = 0.7, same range):
#![allow(unused)]
fn main() {
let walk = RandomWalk::new(-5, 5, 0.7);
let hist = walk.histogram(0, 200, 10_000, &mut rng);
print_histogram(&hist, -5, 5, 10_000);
}
Observe:
- Symmetric (p = 0.5): after enough steps the distribution is approximately uniform โ each of the 11 positions appears with roughly 9% frequency. This is the stationary distribution for a symmetric walk with clamping boundaries.
- Asymmetric (p = 0.7): the walker drifts toward the right boundary; positions near +5 accumulate much higher frequencies than those near โ5. The stationary distribution is geometric, rising steeply toward
max.
Try lowering steps from 200 to 10 and observe that the distribution has not yet converged โ it is still concentrated near start. This illustrates the mixing time of the chain.
Reference Solution
Show full solution
use rand::{Rng, SeedableRng, rngs::SmallRng};
use std::collections::HashMap;
struct RandomWalk {
min: i32,
max: i32,
/// prob_right[i] = probability of stepping right from position `min + i as i32`
prob_right: Vec<f64>,
}
impl RandomWalk {
fn new(min: i32, max: i32, p: f64) -> Self {
let n = (max - min + 1) as usize;
Self { min, max, prob_right: vec![p; n] }
}
fn idx(&self, pos: i32) -> usize {
(pos - self.min) as usize
}
fn step(&self, pos: i32, rng: &mut impl Rng) -> i32 {
let r: f64 = rng.gen();
let p = self.prob_right[self.idx(pos)];
let moved = if r < p { pos + 1 } else { pos - 1 };
moved.clamp(self.min, self.max)
}
fn simulate(&self, start: i32, steps: usize, rng: &mut impl Rng) -> Vec<i32> {
let mut positions = Vec::with_capacity(steps + 1);
let mut current = start;
positions.push(current);
for _ in 0..steps {
current = self.step(current, rng);
positions.push(current);
}
positions
}
fn histogram(
&self,
start: i32,
steps: usize,
trials: usize,
rng: &mut impl Rng,
) -> HashMap<i32, usize> {
let mut hist = HashMap::new();
for _ in 0..trials {
let positions = self.simulate(start, steps, rng);
let final_pos = *positions.last().unwrap();
*hist.entry(final_pos).or_insert(0) += 1;
}
hist
}
}
fn print_histogram(hist: &HashMap<i32, usize>, min: i32, max: i32, trials: usize) {
let max_count = hist.values().copied().max().unwrap_or(1);
let bar_width = 40;
for pos in min..=max {
let count = hist.get(&pos).copied().unwrap_or(0);
let bar_len = count * bar_width / max_count;
let bar: String = "#".repeat(bar_len);
println!("{:4}: {:40} ({:.1}%)", pos, bar, 100.0 * count as f64 / trials as f64);
}
}
fn main() {
let mut rng = SmallRng::seed_from_u64(42);
println!("=== Symmetric walk (p = 0.5) ===");
let walk = RandomWalk::new(-5, 5, 0.5);
let hist = walk.histogram(0, 200, 10_000, &mut rng);
print_histogram(&hist, -5, 5, 10_000);
println!();
println!("=== Asymmetric walk (p = 0.7) ===");
let walk = RandomWalk::new(-5, 5, 0.7);
let hist = walk.histogram(0, 200, 10_000, &mut rng);
print_histogram(&hist, -5, 5, 10_000);
}
Expected output structure (exact percentages vary by seed):
=== Symmetric walk (p = 0.5) ===
-5: ######################################## (9.3%)
-4: ##################################### (8.6%)
-3: ######################################## (9.4%)
-2: ##################################### (8.7%)
-1: ##################################### (8.7%)
0: #################################### (8.5%)
1: ##################################### (8.8%)
2: ##################################### (8.7%)
3: ##################################### (8.9%)
4: #################################### (8.4%)
5: ######################################### (9.6%)
=== Asymmetric walk (p = 0.7) ===
-5: (0.0%)
-4: (0.1%)
-3: # (0.4%)
-2: ## (1.2%)
-1: ##### (3.6%)
0: ########### (7.8%)
1: ################### (13.4%)
2: ############################### (21.6%)
3: ######################################## (27.8%)
4: ####################################### (27.4%)
5: ################################ (22.3%)
The symmetric walk converges to a nearly flat distribution โ each position visited roughly 1/11 โ 9.1% of the time. The asymmetric walk piles up at the right boundary, with positions +3 โฆ +5 capturing the bulk of the probability mass.
Part 3 โ Text Generation
6. Text Generation with Markov Chains
Text can be modeled as a Markov chain where each state is a word (or sequence of words) and transitions represent which words tend to follow. This section explains the bigram model, why it produces surprisingly coherent short sequences, and what its limitations reveal about the relationship between statistical models and language. No code is written here โ it prepares you for Exercises 3 and 4.
Words as states. To model text as a Markov chain, treat each word as a state and the act of writing the next word as a transition. Given the word โcat,โ what word comes next? A Markov model answers that question by consulting statistics drawn from a training corpus: it scans every occurrence of โcatโ and records which words followed it, then samples from that empirical distribution. The result is a sequence of words generated one step at a time, each word chosen probabilistically based only on the current word โ not on the full sentence or paragraph that came before. This is the Markov property applied to language.
Bigrams and transition tables. A bigram is an ordered pair of adjacent words. To build a bigram model, scan the corpus left to right and record every consecutive word pair. Count how many times each pair appears, then for each word w express those counts as a probability distribution over successor words. This distribution forms one row of the transition table: a lookup from every word to a weighted list of what can follow it. The table is learned entirely from data โ no grammar rules, no meaning, just co-occurrence statistics.
Worked example. Consider this two-sentence corpus:
โthe cat sat on the mat. the cat sat on the hat.โ
Scanning the corpus (treating each sentence as a word sequence and ignoring the periods) yields the following bigrams and their counts. Dividing each count by the row total gives the transition probability:
| Current word | Next word | Count | Probability |
|---|---|---|---|
| the | cat | 2 | 0.50 |
| the | mat | 1 | 0.25 |
| the | hat | 1 | 0.25 |
| cat | sat | 2 | 1.00 |
| sat | on | 2 | 1.00 |
| on | the | 2 | 1.00 |
Notice that โcat,โ โsat,โ and โonโ each have only one possible successor โ the small corpus left them no choice. Only โtheโ branches: half the time it is followed by โcat,โ a quarter of the time by โmat,โ and a quarter by โhat.โ Starting from โthe,โ one plausible generated sequence is: the โ cat โ sat โ on โ the โ hat โ and then โhatโ would be an end-of-sentence state.
Why the text sounds strange. Short stretches of output are surprisingly readable: every adjacent pair the model produces appeared in the training data, so each local transition is both grammatically and semantically plausible. The problem surfaces over longer spans. Because the model has no memory beyond the current word, it cannot maintain a topic, complete a thought, or avoid contradictions introduced a few sentences back. The result resembles text written by someone who half-knows the language: each individual step looks right, but the destination keeps shifting without purpose.
Order-n chains. A bigram model is an order-1 Markov chain โ the next state depends on exactly one previous word. An order-2 model (a trigram model) conditions on the last two words; order n uses the last n words as context. Increasing n brings sharply improved local coherence โ at n = 3 or 4 the output begins to reproduce full phrases from the corpus verbatim โ but at the cost of novelty. A high-order model has likely seen most of its context only once, so it often has no real choice but to copy the source text rather than recombine it. The right tradeoff depends on corpus size: larger corpora can support higher n without the model simply memorising its training data.
7. Exercise 3 โ Bigram Text Generator
Goal: Build a Markov chain over words from an input text. Each state is a single word; transitions are learned from the corpus. Generate novel word sequences of a given length from a chosen seed word.
Setup
You can extend the random-walk project from Exercise 2, or create a fresh binary:
cargo new bigram-text
cd bigram-text
cargo add rand
You will also need use std::collections::HashMap; at the top of src/main.rs.
Starter Code
Replace the contents of src/main.rs with the following skeleton. Do not change the struct layout or function signatures โ your task is to fill in the todo!() bodies and write main.
use rand::Rng;
use std::collections::HashMap;
struct BigramModel {
/// transitions[word] = list of (next_word, weight) pairs
transitions: HashMap<String, Vec<(String, usize)>>,
}
impl BigramModel {
fn train(corpus: &str) -> Self { todo!() }
fn generate(&self, seed: &str, length: usize, rng: &mut impl Rng) -> Vec<String> { todo!() }
}
fn main() { todo!() }
Step 1 โ Tokenize the corpus
Inside train, split the corpus into lowercase words:
#![allow(unused)]
fn main() {
let words: Vec<String> = corpus
.split_whitespace()
.map(|s| s.to_lowercase())
.collect();
}
split_whitespace handles any run of spaces, newlines, or tabs. Lowercasing ensures โTheโ and โtheโ are treated as the same state. For this exercise we leave punctuation attached to words (so โsat.โ and โsatโ are distinct) to keep the code simple โ stripping it is a good stretch goal.
Step 2 โ Build the transition table
Iterate every consecutive word pair using windows(2). The first word is the current state; the second is the successor to record:
#![allow(unused)]
fn main() {
for window in words.windows(2) {
let current = window[0].clone();
let next = window[1].clone();
let entry = transitions.entry(current).or_default();
if let Some(pair) = entry.iter_mut().find(|(w, _)| w == &next) {
pair.1 += 1;
} else {
entry.push((next, 1));
}
}
}
For each word, build a list of (successor_word, count) pairs. If the successor already appears in the list, increment its count; otherwise push a new entry with count 1. After scanning the whole corpus, every word maps to a weighted list of what can follow it โ exactly one row of the bigram transition table.
Return BigramModel { transitions }.
Step 3 โ Implement weighted random sampling
generate must sample the next word proportionally to its observed count in the successor list. Add a helper function outside impl BigramModel that performs integer weighted sampling โ faster and numerically exact compared to floating-point accumulation:
#![allow(unused)]
fn main() {
fn sample_weighted<'a>(choices: &'a [(String, usize)], rng: &mut impl Rng) -> &'a str {
let total: usize = choices.iter().map(|(_, w)| w).sum();
let mut r = rng.gen_range(0..total);
for (word, weight) in choices {
if r < *weight {
return word;
}
r -= weight;
}
&choices.last().unwrap().0 // fallback: satisfies the compiler
}
}
The r -= weight trick avoids floating-point comparisons: draw a random index in [0, total), then walk the list subtracting each bucketโs weight until the remaining value lands inside the current entry. This is the discrete-distribution equivalent of the cumulative-probability walk used in Exercise 1, rewritten in integer arithmetic.
Step 4 โ Implement BigramModel::generate
generate starts from a seed word and appends one word at a time:
- Push the seed onto
outputand setcurrent = seed.to_string(). - Loop up to
lengthtimes: look upcurrentinself.transitionsand callsample_weightedon the result. - Push the sampled word onto
outputand advancecurrentto it. - If
currentis not in the transition table (a dead end โ the word never appeared mid-sentence in the corpus), stop early withbreak.
#![allow(unused)]
fn main() {
let mut output = vec![seed.to_string()];
let mut current = seed.to_string();
for _ in 0..length {
match self.transitions.get(¤t) {
None => break,
Some(choices) => {
let next = sample_weighted(choices, rng).to_string();
output.push(next.clone());
current = next;
}
}
}
output
}
Step 5 โ Run the generator
In main, train on a short corpus and generate several sequences from different seed words. Use a seeded RNG for reproducibility (add use rand::{SeedableRng, rngs::SmallRng};):
const CORPUS: &str =
"alice was beginning to get very tired of sitting by her sister on the \
bank and of having nothing to do once or twice she had peeped into the \
book her sister was reading but it had no pictures or conversations in \
it and what is the use of a book thought alice without pictures or \
conversations alice was beginning to get very tired of sitting";
fn main() {
let mut rng = SmallRng::seed_from_u64(42);
let model = BigramModel::train(CORPUS);
for seed in &["alice", "the", "book"] {
let words = model.generate(seed, 15, &mut rng);
println!("{}", words.join(" "));
}
}
Expected output structure (exact words vary by seed):
alice was beginning to get very tired of sitting by her sister was beginning
the use of a book her sister on the bank and of sitting by her sister
book thought alice without pictures or conversations alice was beginning to get
Observe that every adjacent pair in the output appeared in the training corpus โ each local transition is faithful to the source text. But over longer spans the topic shifts erratically, because the model has no memory beyond the immediately preceding word.
Step 6 โ Try a larger corpus
The small Alice snippet produces repetitive output because many words have only one possible successor in a short text. To see genuine branching behaviour, download the first chapter of Aliceโs Adventures in Wonderland from Project Gutenberg (plain text, freely available) and feed the full text to BigramModel::train. With more data:
- Common words like โtheโ and โandโ branch to dozens of successors.
- Generated sequences stay on-topic for longer stretches before the topic shifts.
- The difference between the single-word context used here and the two-word context in Exercise 4 becomes immediately obvious.
You can embed the text file directly in the binary with Rustโs include_str! macro:
#![allow(unused)]
fn main() {
const CORPUS: &str = include_str!("../corpus/alice_ch1.txt");
}
Reference Solution
Show full solution
use rand::{Rng, SeedableRng, rngs::SmallRng};
use std::collections::HashMap;
struct BigramModel {
/// transitions[word] = list of (next_word, weight) pairs
transitions: HashMap<String, Vec<(String, usize)>>,
}
impl BigramModel {
fn train(corpus: &str) -> Self {
let words: Vec<String> = corpus
.split_whitespace()
.map(|s| s.to_lowercase())
.collect();
let mut transitions: HashMap<String, Vec<(String, usize)>> = HashMap::new();
for window in words.windows(2) {
let current = window[0].clone();
let next = window[1].clone();
let entry = transitions.entry(current).or_default();
if let Some(pair) = entry.iter_mut().find(|(w, _)| w == &next) {
pair.1 += 1;
} else {
entry.push((next, 1));
}
}
BigramModel { transitions }
}
fn generate(&self, seed: &str, length: usize, rng: &mut impl Rng) -> Vec<String> {
let mut output = vec![seed.to_string()];
let mut current = seed.to_string();
for _ in 0..length {
match self.transitions.get(¤t) {
None => break,
Some(choices) => {
let next = sample_weighted(choices, rng).to_string();
output.push(next.clone());
current = next;
}
}
}
output
}
}
fn sample_weighted<'a>(choices: &'a [(String, usize)], rng: &mut impl Rng) -> &'a str {
let total: usize = choices.iter().map(|(_, w)| w).sum();
let mut r = rng.gen_range(0..total);
for (word, weight) in choices {
if r < *weight {
return word;
}
r -= weight;
}
&choices.last().unwrap().0
}
const CORPUS: &str =
"alice was beginning to get very tired of sitting by her sister on the \
bank and of having nothing to do once or twice she had peeped into the \
book her sister was reading but it had no pictures or conversations in \
it and what is the use of a book thought alice without pictures or \
conversations alice was beginning to get very tired of sitting";
fn main() {
let mut rng = SmallRng::seed_from_u64(42);
let model = BigramModel::train(CORPUS);
for seed in &["alice", "the", "book"] {
let words = model.generate(seed, 15, &mut rng);
println!("{}", words.join(" "));
}
}
8. Exercise 4 โ N-gram Generalization
Goal: Generalize the bigram model to an n-gram model where each state is a window of n consecutive words. Compare the output quality for n = 1, 2, 3, and 4 on the same corpus.
Setup
Extend the project from Exercise 3, or create a fresh one:
cargo new ngram-chain
cd ngram-chain
cargo add rand
The NgramModel stores transitions keyed by Vec<String> โ a window of n consecutive words. Vec<String> implements Hash and Eq, so it works directly as a HashMap key with no extra wrapping.
Starter Code
Replace (or extend) src/main.rs with the following skeleton:
use rand::Rng;
use std::collections::HashMap;
struct NgramModel {
n: usize,
transitions: HashMap<Vec<String>, Vec<(String, usize)>>,
}
impl NgramModel {
fn train(corpus: &str, n: usize) -> Self { todo!() }
fn generate(&self, seed: Vec<String>, length: usize, rng: &mut impl Rng) -> Vec<String> { todo!() }
}
fn main() { todo!() }
Step 1 โ Tokenize and build the transition table
Inside train, convert the corpus into a flat list of lowercase words:
#![allow(unused)]
fn main() {
let words: Vec<String> = corpus
.split_whitespace()
.map(|s| s.to_lowercase())
.collect();
}
Then iterate over sliding windows of size n + 1. The first n elements form the key (the context); the last element is the successor word to record:
#![allow(unused)]
fn main() {
for window in words.windows(n + 1) {
let key: Vec<String> = window[..n].to_vec();
let next: String = window[n].clone();
// insert (key โ next) into the transition table
}
}
Use HashMap::entry(...).or_default() to get-or-insert the successor list. Scan for an existing entry for next and increment its count, or push (next, 1) if it is new.
Step 2 โ Weighted sampling helper
Reuse the same cumulative-weight technique from Exercise 3. The helper below accepts any slice of (word, count) pairs and returns a randomly chosen word with probability proportional to its count:
#![allow(unused)]
fn main() {
fn sample<'a>(choices: &'a [(String, usize)], rng: &mut impl Rng) -> &'a str {
let total: usize = choices.iter().map(|(_, w)| w).sum();
let mut r = rng.gen_range(0..total);
for (word, weight) in choices {
if r < *weight {
return word;
}
r -= weight;
}
&choices.last().unwrap().0
}
}
Step 3 โ Implement NgramModel::generate
generate is called with a seed โ a Vec<String> of exactly n words. It extends that seed word by word, up to length additional words, using a sliding window to track the current context:
- Copy the seed into
outputand into aVecDeque<String>calledwindow. - Each step: collect
windowinto aVec<String>to use as the lookup key. - If the key is missing from
self.transitions, stop early โ the chain has reached a dead end. - Sample the next word, push it onto
output, pop the oldest word fromwindow, push the new word.
#![allow(unused)]
fn main() {
use std::collections::VecDeque;
// inside generate:
let mut window: VecDeque<String> = VecDeque::from(seed.clone());
let mut output = seed;
for _ in 0..length {
let key: Vec<String> = window.iter().cloned().collect();
match self.transitions.get(&key) {
None => break,
Some(choices) => {
let next = sample(choices, rng).to_string();
output.push(next.clone());
window.pop_front();
window.push_back(next);
}
}
}
output
}
Step 4 โ Compare n = 1, 2, 3, 4
In main, train one model per value of n and generate 50 additional words from each. Use a fixed RNG seed for reproducibility. The short Alice corpus below is enough to observe the trend; swap in a larger public-domain text (e.g., the first chapter of Aliceโs Adventures in Wonderland from Project Gutenberg) for more interesting output.
const CORPUS: &str =
"alice was beginning to get very tired of sitting by her sister on the \
bank and of having nothing to do once or twice she had peeped into the \
book her sister was reading but it had no pictures or conversations in \
it and what is the use of a book thought alice without pictures or \
conversations alice was beginning to get very tired of sitting";
fn main() {
use rand::{SeedableRng, rngs::SmallRng};
for n in 1..=4 {
let model = NgramModel::train(CORPUS, n);
let mut rng = SmallRng::seed_from_u64(42);
// seed = the first n words of the corpus
let seed: Vec<String> = CORPUS
.split_whitespace()
.take(n)
.map(|s| s.to_lowercase())
.collect();
let words = model.generate(seed, 50, &mut rng);
println!("n={}: {}", n, words.join(" "));
println!();
}
}
Expected observations:
- n = 1: no context at all; the model samples from the global word-frequency distribution, producing word soup with only rough statistical flavour.
- n = 2 (bigram): every adjacent pair appeared in the corpus, so individual transitions feel plausible; the topic still shifts erratically over longer runs.
- n = 3 (trigram): longer coherent stretches emerge; you will start to recognise verbatim phrases from the corpus.
- n = 4: on a small corpus, most 4-word contexts appear only once, leaving the model no real choice but to reproduce the training text nearly verbatim. Try a larger corpus to see n = 4 produce novel output.
Step 5 โ Memorisation vs novelty
The pattern above is the central tension in all n-gram language models:
- Small n: short context โ many plausible continuations โ high novelty, low coherence.
- Large n: long context โ typically a unique continuation โ low novelty, high local fidelity to the corpus.
The corpus size determines the crossover point. For a paragraph-sized text, n = 2 is usually the maximum useful order. For a novel-length corpus, n = 4 or 5 can produce readable, novel output without simply transcribing the source.
Stretch Goal โ Character-level n-grams
Swap words for individual characters: tokenize with .chars() instead of .split_whitespace(). The model and sampling logic are unchanged โ only the โtokenโ definition shifts from words to characters:
#![allow(unused)]
fn main() {
// Character-level tokenization for train:
let chars: Vec<String> = corpus.chars().map(|c| c.to_string()).collect();
// Replace `words` with `chars` and proceed identically.
}
Generate 200 characters at n = 3, 5, and 8. You will see the model progress from random letter sequences, to plausible letter clusters and syllables, to recognisable words and phrases as n grows. This also demonstrates that the n-gram approach is domain-agnostic: the same code works for words, characters, DNA bases, MIDI note sequences, or any discrete token stream.
Reference Solution
Show full solution
use rand::{Rng, SeedableRng, rngs::SmallRng};
use std::collections::{HashMap, VecDeque};
struct NgramModel {
n: usize,
transitions: HashMap<Vec<String>, Vec<(String, usize)>>,
}
impl NgramModel {
fn train(corpus: &str, n: usize) -> Self {
let words: Vec<String> = corpus
.split_whitespace()
.map(|s| s.to_lowercase())
.collect();
let mut transitions: HashMap<Vec<String>, Vec<(String, usize)>> = HashMap::new();
for window in words.windows(n + 1) {
let key: Vec<String> = window[..n].to_vec();
let next = window[n].clone();
let entry = transitions.entry(key).or_default();
if let Some(pair) = entry.iter_mut().find(|(w, _)| w == &next) {
pair.1 += 1;
} else {
entry.push((next, 1));
}
}
NgramModel { n, transitions }
}
fn generate(&self, seed: Vec<String>, length: usize, rng: &mut impl Rng) -> Vec<String> {
assert_eq!(seed.len(), self.n, "seed must have exactly n words");
let mut window: VecDeque<String> = VecDeque::from(seed.clone());
let mut output = seed;
for _ in 0..length {
let key: Vec<String> = window.iter().cloned().collect();
match self.transitions.get(&key) {
None => break,
Some(choices) => {
let next = sample(choices, rng).to_string();
output.push(next.clone());
window.pop_front();
window.push_back(next);
}
}
}
output
}
}
fn sample<'a>(choices: &'a [(String, usize)], rng: &mut impl Rng) -> &'a str {
let total: usize = choices.iter().map(|(_, w)| w).sum();
let mut r = rng.gen_range(0..total);
for (word, weight) in choices {
if r < *weight {
return word;
}
r -= weight;
}
&choices.last().unwrap().0
}
const CORPUS: &str =
"alice was beginning to get very tired of sitting by her sister on the \
bank and of having nothing to do once or twice she had peeped into the \
book her sister was reading but it had no pictures or conversations in \
it and what is the use of a book thought alice without pictures or \
conversations alice was beginning to get very tired of sitting";
fn main() {
for n in 1..=4 {
let model = NgramModel::train(CORPUS, n);
let mut rng = SmallRng::seed_from_u64(42);
let seed: Vec<String> = CORPUS
.split_whitespace()
.take(n)
.map(|s| s.to_lowercase())
.collect();
let words = model.generate(seed, 50, &mut rng);
println!("n={}: {}", n, words.join(" "));
println!();
}
}
Part 4 โ Deeper Concepts
9. Stationary Distributions
A stationary distribution ฯ is a probability distribution over states that is unchanged by one step of the chain: ฯ P = ฯ. This section covers how to find stationary distributions analytically (for small chains) and via power iteration, and explains when they exist and are unique โ introducing the concepts of irreducibility and aperiodicity.
The stationary equation. Write the state distribution as a row vector ฯ = [ฯโ, ฯโ, โฆ, ฯ_{nโ1}]. Then ฯ is stationary if it satisfies two simultaneous conditions:
ฯP = ฯ (fixed-point equation)
ฮฃ ฯแตข = 1 (normalisation)
ฯแตข โฅ 0 for all i
The fixed-point equation says that right-multiplying ฯ by the transition matrix returns exactly ฯ: one step of the chain leaves the distribution unchanged. In terms of individual entries, the condition expands to:
ฯโฑผ = ฮฃแตข ฯแตข ยท P[i][j] for every j
The probability mass flowing into state j from every state i โ weighted by how likely the chain is to be in each i โ exactly equals the current probability already in state j. No state is gaining or losing probability over time; the distribution is in balance.
Long-run frequency interpretation. The simulations in Exercise 1 showed this balance in practice: after thousands of steps, the fraction of time the chain occupied each state stabilised near 2/3 Sunny and 1/3 Rainy, regardless of the starting state. This is the ergodic theorem for Markov chains: under appropriate conditions, the time-average frequency of visiting state i converges to ฯแตข almost surely. The stationary distribution is not merely a mathematical fixed point โ it is the long-run proportion of time the chain spends in each state, and it is precisely what your simulations were converging toward.
Irreducibility. A chain is irreducible if every state is reachable from every other state in some finite number of steps. Formally, for every pair (i, j) there exists k โฅ 1 with P^k[i][j] > 0. An irreducible chain cannot get permanently trapped in a proper subset of states; all states form a single communicating class. A chain with an absorbing state (one where P[i][i] = 1 and the chain can enter but never leave) is not irreducible. Irreducibility is what rules out a chain splitting into separate โislandsโ each with its own local stationary distribution.
Aperiodicity. A state has period d if returns to that state can only happen in a number of steps that is a multiple of d. Formally, d = gcd{k โฅ 1 : P^k[i][i] > 0}. A state is aperiodic if d = 1, and a chain is aperiodic if every state is. An aperiodic chain does not oscillate on a fixed cycle โ there is no rhythm forcing the chain to return to state i only every 2 or every 5 steps. A self-loop (P[i][i] > 0) immediately makes state i aperiodic, because the chain can return after 1 step, so the gcd collapses to 1. A finite, irreducible, aperiodic chain is called ergodic, and such a chain is guaranteed to have exactly one stationary distribution. Furthermore, starting from any initial distribution ฯโ, the sequence ฯโ P^k converges to ฯ as k โ โ. The weather chain is both irreducible (Sunny โ Rainy and Rainy โ Sunny are both reachable in one step) and aperiodic (both states have positive self-loop probabilities: P[0][0] = 0.8 and P[1][1] = 0.6).
Analytical solution for the weather chain. To solve ฯ P = ฯ for the 2-state chain, write out the equation component-by-component. With ฯ = [ฯโ, ฯโ] and the weather transition matrix:
ฯP = ฯ expands to:
ฯโ ยท P[0][0] + ฯโ ยท P[1][0] = ฯโ โ 0.8ยทฯโ + 0.4ยทฯโ = ฯโ
ฯโ ยท P[0][1] + ฯโ ยท P[1][1] = ฯโ โ 0.2ยทฯโ + 0.6ยทฯโ = ฯโ
Both equations carry the same information (one follows from the other because the rows of P sum to 1 and the probabilities sum to 1), so use only the first and add the normalisation constraint:
0.8ยทฯโ + 0.4ยทฯโ = ฯโ
0.4ยทฯโ = 0.2ยทฯโ
ฯโ = 2ยทฯโ
Normalisation: ฯโ + ฯโ = 1
2ยทฯโ + ฯโ = 1
ฯโ = 1/3 โ 0.333
ฯโ = 2/3 โ 0.667
The unique stationary distribution is ฯ โ [0.667, 0.333]: about two-thirds of all days are Sunny and one-third are Rainy in the long run. This matches the 66.7 % / 33.3 % figures from the Exercise 1 simulation.
Power iteration. For chains with many states, setting up and solving the linear system ฯ P = ฯ directly is impractical. Power iteration is a standard numerical alternative: start from any distribution and repeatedly apply P until successive distributions are indistinguishable. Convergence to the unique ฯ is guaranteed for ergodic chains.
// Pseudocode: power iteration for stationary distribution
let ฯ = uniform distribution over all n states
loop:
let ฯ_next = ฯ ยท P // one step of the chain
if max |ฯ_next[i] - ฯ[i]| < tolerance:
break
ฯ = ฯ_next
// ฯ now approximates the stationary distribution
A concrete Rust sketch for the 2-state weather chain:
fn power_iteration(p: [[f64; 2]; 2], tolerance: f64) -> [f64; 2] {
let mut pi = [0.5_f64, 0.5];
loop {
let pi_next = [
pi[0] * p[0][0] + pi[1] * p[1][0],
pi[0] * p[0][1] + pi[1] * p[1][1],
];
let diff = (pi_next[0] - pi[0]).abs().max((pi_next[1] - pi[1]).abs());
pi = pi_next;
if diff < tolerance {
break;
}
}
pi
}
fn main() {
let p = [[0.8, 0.2], [0.4, 0.6]];
let pi = power_iteration(p, 1e-9);
println!("ฯ โ [{:.6}, {:.6}]", pi[0], pi[1]);
// Output: ฯ โ [0.666667, 0.333333]
}
Running this for the weather chain converges in roughly 60 iterations and agrees with the analytical result. For an n-state chain, each iteration costs O(nยฒ) and the number of iterations needed scales inversely with the spectral gap โ the difference between the two largest eigenvalues of P. A large spectral gap means fast convergence; a gap close to zero means the chain mixes slowly and power iteration requires many steps.
10. Applications and Further Reading
Markov chains appear throughout computer science and mathematics: PageRank, MCMC sampling, hidden Markov models, reinforcement learning, and more. This section surveys these applications at a high level and points to books, papers, and courses for learners who want to go deeper on any thread.
Application Survey
PageRank. Googleโs original ranking algorithm modeled the web as a Markov chain: each page is a state, and each hyperlink is a transition with probability proportional to the number of outgoing links on the source page. A small โteleportationโ probability was added so a random surfer occasionally jumps to a uniformly random page rather than following a link, ensuring the chain is irreducible and has a unique stationary distribution. The stationary probability of each page โ the fraction of time a random surfer spends there in the long run โ becomes its rank. Because the web had billions of pages, computing the stationary distribution via power iteration rather than direct matrix inversion was essential; the same convergence guarantee from Section 9 applies at planetary scale.
Markov Chain Monte Carlo (MCMC). Bayesian inference often requires integrating over high-dimensional parameter spaces where the posterior distribution has no closed form. MCMC methods solve this by constructing a Markov chain whose stationary distribution is the target posterior, then running the chain long enough that its samples approximate draws from that distribution. The Metropolis-Hastings algorithm is the foundational recipe: propose a move to a new state, accept it with a probability that preserves detailed balance, and reject it otherwise. Variants such as Gibbs sampling, Hamiltonian Monte Carlo, and the No-U-Turn Sampler (NUTS) power nearly every modern probabilistic programming framework, from Stan to PyMC.
Hidden Markov Models (HMMs). An HMM separates a Markov chain into two layers: a hidden state sequence that evolves according to a transition matrix, and an observation sequence where each hidden state emits an observable symbol with some probability. The key insight is that the true states are never directly seen โ only the observations are. HMMs were the dominant approach in speech recognition for decades (phonemes as hidden states, acoustic features as observations) and remain central in bioinformatics for gene prediction and sequence segmentation. The Viterbi algorithm finds the most likely hidden state path for a given observation sequence in time linear in the sequence length; the Baum-Welch algorithm trains HMMs from unlabelled data using expectation-maximisation.
Reinforcement Learning. Most reinforcement learning problems are formulated as Markov Decision Processes (MDPs), which augment a Markov chain with a set of actions and a scalar reward signal. At each step the agent chooses an action, the environment transitions to a new state according to a transition distribution that depends on both the current state and the chosen action, and the agent receives a reward. The goal is to find a policy โ a rule mapping states to actions โ that maximises cumulative discounted reward. Because the next state depends only on the current state and action (not the history), the Markov property is what makes value-function algorithms like Q-learning and policy-gradient methods tractable. Sutton and Bartoโs textbook (listed below) treats this connection in full rigour.
Bioinformatics โ Sequence Analysis. DNA and protein sequences are naturally modelled as Markov chains over an alphabet of bases or amino acids. A simple k-th order Markov model assigns probabilities to short subsequences and can distinguish coding regions from non-coding regions: CpG islands in mammalian genomes, for example, have transition probabilities measurably different from the genomic background. Profile HMMs generalise this to align whole families of sequences โ each column in a multiple sequence alignment becomes a hidden state with its own emission distribution, allowing robust database search even for distantly related proteins.
Queueing Theory. The classic M/M/1 queue โ arrivals according to a Poisson process, exponential service times, a single server โ is a continuous-time Markov chain on the non-negative integers, where the state is the number of customers in the system. Its stationary distribution is geometric, giving simple closed-form expressions for average queue length and waiting time. More complex queueing networks (multiple servers, priorities, finite buffers) extend the same framework and are used to size data-centre infrastructure, analyse hospital emergency departments, and design network switches. Continuous-time Markov chains replace the transition matrix with a generator matrix Q whose off-diagonal entries are transition rates rather than probabilities.
Language Models. The n-gram models from Exercises 3 and 4 are finite-order Markov chains over word tokens, and they directly preceded modern neural language models. In the 1990s and 2000s, trigram and 4-gram models with smoothing (Kneser-Ney, Witten-Bell) were the state of the art for machine translation and speech recognition. Neural language models replaced explicit Markov structure with learned representations, but the conceptual scaffolding is the same: predict the next token from a bounded window of context. Understanding the Markov chain view of language โ its local coherence, its lack of long-range memory, its reliance on corpus statistics โ clarifies both what earlier systems got right and what neural approaches had to learn to transcend.
Further Reading
Books
-
Blitzstein & Hwang โ Introduction to Probability (2nd ed.) A beautifully written undergraduate probability text with a full chapter on Markov chains. The authorsโ lecture videos and course materials are freely available; a free PDF of the book is offered on the bookโs companion site. Start here if your probability background is thin or if you want every concept illustrated with concrete examples before the formalism arrives.
-
Norris โ Markov Chains The standard rigorous treatment at the advanced-undergraduate / early-graduate level. Covers discrete- and continuous-time chains, convergence, reversibility, and applications with full proofs. Dense but thorough; worth working through if you intend to read research papers that use Markov chains as a theoretical tool.
-
Sutton & Barto โ Reinforcement Learning: An Introduction (2nd ed.) The canonical RL textbook, freely available as a PDF from the authors. Chapters 3 and 4 formalise MDPs and dynamic programming using exactly the Markov chain machinery developed in this course. Reading those chapters after completing this course is a natural next step toward understanding how modern game-playing and robotics agents are designed.
Articles
- Metropolis-Hastings algorithm โ Wikipedia. A well-maintained article that covers the algorithm statement, acceptance-ratio derivation, intuition for why detailed balance implies the correct stationary distribution, and pseudocode. A good companion to the original 1953 Metropolis et al. paper (five pages, freely available) and the 1970 Hastings generalisation.
Rust Ecosystem Pointers
The exercises in this course used rand for sampling. A few other crates are useful as you build more serious probabilistic programs in Rust:
-
nalgebraโ a comprehensive linear-algebra library covering vectors, dense matrices, and decompositions. Use it to compute P^k via repeated matrix multiplication or to solve the stationary-distribution equations ฯ P = ฯ, ฮฃ ฯแตข = 1 as a linear system. -
petgraphโ graph data structures and algorithms. Markov chains are directed weighted graphs, andpetgraphlets you represent them as such, run graph-theoretic algorithms (strongly connected components give you irreducible sub-chains), and visualise the structure via its Graphviz export. -
statrsโ a statistics library providing common probability distributions with their PDFs, CDFs, and samplers. Useful when building emission distributions for HMMs or when you need chi-squared tests to check whether simulated chain frequencies match theoretical stationary probabilities.