Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Vibed Learning

Welcome to Vibed Learning — a collection of self-guided courses and technical references on programming, computer science, and software engineering.

What This Is

This site is an experiment in generating custom learning resources using large language models. Every course and reference here was created with Claude Code, using Anthropic’s Opus 4.6 and Sonnet 4.6 models. The goal is to explore how AI-assisted authorship can produce educational content that is clear, accurate, and genuinely useful.

The content is a work in progress. New courses are added as topics are explored, and existing material is revised as needed.

Content

Browse the table of contents on the left to find courses. Current topics include:

  • Markov Chains — a self-guided introduction to the math and intuition behind Markov chains
  • Vector Databases — how vector search works and when to use it
  • Git Worktrees — using multiple working trees with a single Git repository
  • Writing a Lisp-to-C Compiler in Rust — building a compiler from scratch

License

All content on this site is released under CC0 (Public Domain). You may use, copy, modify, and distribute it freely, without asking for permission and without attribution.

Contact

Questions, corrections, or suggestions: vibebooks@elijah.run

Markov Chain Self-Guided Course

This document is a self-guided course on Markov chains. It is organized into four parts: conceptual foundations, first Rust implementations, text generation, and deeper theory. Each section is either a reading lesson or a hands-on Rust programming exercise. Sections marked 🚧 are stubs whose full content is tracked in an nbd ticket — follow the ticket ID to find the detailed learning objectives and instructions.


Table of Contents

Part 1 — Foundations

  1. What Is a Markov Chain?
  2. States and Transitions
  3. Transition Probabilities and Matrices

Part 2 — First Implementation

  1. Exercise 1 — Weather Model
  2. Exercise 2 — Simulating a Random Walk

Part 3 — Text Generation

  1. Text Generation with Markov Chains
  2. Exercise 3 — Bigram Text Generator
  3. Exercise 4 — N-gram Generalization

Part 4 — Deeper Concepts

  1. Stationary Distributions
  2. Applications and Further Reading

Part 1 — Foundations

1. What Is a Markov Chain?

A Markov chain is a mathematical model describing a sequence of events where the probability of each event depends only on the state reached in the previous event — not on the full history. This “memoryless” property is called the Markov property. You will learn where Markov chains appear in the real world and develop intuition for why the memoryless property is both a useful simplification and a meaningful assumption.

The core idea: only now matters. Imagine you are tracking today’s weather. Intuitively, you might think yesterday’s weather, the week before, and the entire season all influence what tomorrow will bring. A Markov chain says: forget all of that. Given that you know today’s weather, knowledge of every earlier day adds nothing to your prediction of tomorrow. The present state captures everything relevant from the past. This is the Markov property — colloquially “memorylessness” — and it is a surprisingly powerful modelling assumption.

Formally, let X₀, X₁, X₂, … be a sequence of random variables each taking values in some set of states. The sequence is a Markov chain if, for every time step n and every state s:

P(Xₙ₊₁ = s | X₀, X₁, …, Xₙ) = P(Xₙ₊₁ = s | Xₙ)

The left-hand side conditions on the entire history up to step n. The right-hand side conditions on only the current state Xₙ. The equation says these two quantities are always equal: no matter how you got to the current state, your distribution over the next state is the same.

A worked example — the weather model. Suppose a city has two kinds of days: Sunny and Rainy. You observe that:

  • After a Sunny day, there is an 80 % chance of another Sunny day and a 20 % chance of Rain.
  • After a Rainy day, there is a 40 % chance of Sun and a 60 % chance of Rain.

Under the Markov assumption, these two rules are all you need. If today is Sunny, the chance of rain tomorrow is 20 % — regardless of whether the preceding week was a drought or a monsoon. The model is simple because it deliberately ignores deep history, and it is useful precisely because that history often adds little predictive power once you know the current state.

Where Markov chains appear in the real world. Once you recognise the Markov property you will spot it everywhere:

  • Board games. In Snakes and Ladders the only thing that matters is which square you are on right now. The sequence of rolls that brought you there is irrelevant — your future depends only on your current position.
  • Web surfing (PageRank). Google’s original PageRank algorithm modelled a hypothetical random web surfer who, at each page, clicks a link chosen uniformly at random. Where the surfer goes next depends only on the current page, not the path taken to reach it. The long-run fraction of time spent at each page is the page’s rank.
  • Genetics. In simple population-genetics models, the number of copies of a gene in the current generation determines the distribution of copies in the next generation. The frequencies in prior generations, once summarised in the current count, carry no additional information.
  • Text generation. Given the last word (or last few words) of a sentence, the probability of the next word can be estimated from a corpus. The full sentence history is ignored — only the recent context matters. Sections 6–8 of this course build exactly this kind of model in Rust.

Why the Markov property is a useful assumption. Real systems are almost never perfectly memoryless — yesterday’s weather genuinely does carry a whisper of information beyond today’s. So why use Markov models? Because they strike a remarkable balance between tractability and expressiveness. A model that conditions on the entire history is usually intractable; one that ignores history entirely is too crude. The Markov property is the sweet spot: it allows rigorous mathematical analysis (stationary distributions, convergence theorems, efficient simulation) while still capturing the essential dynamics of many real processes. When the assumption is too crude, you can extend it by enlarging the state space — for instance, tracking the last two days of weather instead of one — and the Markov property holds again at that richer level of description.


2. States and Transitions

Every Markov chain consists of a set of states and a rule for how the system moves between them. The set of all possible states is called the state space, commonly written S. In this course the state space is always discrete — it is either finite (e.g., {Sunny, Rainy}) or countably infinite (e.g., the non-negative integers {0, 1, 2, …}, as in a random walk on the number line). Discrete state spaces are by far the most common setting for introductory Markov chain theory and for the kinds of simulations you will build in Rust.

A transition is a single step from one state to another. At each discrete time step the chain occupies some state iS, then moves to state jS with probability P(i → j). Two things matter: (1) transitions are directed — going from i to j and going from j to i are distinct transitions with potentially different probabilities; and (2) every transition has an associated probability, and the probabilities of all transitions out of a given state must sum to 1. A transition from i back to i — a self-loop — is perfectly valid and simply means the chain stays put with some nonzero probability.

State-transition diagrams make these rules visual. Draw one node for each state and one labelled arrow for each possible transition; the label is the transition probability. For the two-state weather model from Section 1 the diagram looks like this:

          0.8                          0.6
     ╭─────────╮                  ╭─────────╮
     │         │       0.2        │         │
     ▼         │  ─────────────►  │         ▼
  ╔═══════╗ ───┘                  └─── ╔═══════╗
  ║ SUNNY ║                           ║ RAINY ║
  ╚═══════╝ ───┐                  ┌─── ╚═══════╝
               │       0.4        │
               └──────────────────┘

The self-loops (Sunny → Sunny with 0.8, Rainy → Rainy with 0.6) show that the chain can stay in the same state. The cross-arrows capture the transitions between states. Every row of outgoing arrows sums to 1: from Sunny, 0.8 + 0.2 = 1; from Rainy, 0.4 + 0.6 = 1.

States in a Markov chain differ in how they relate to the chain’s long-run behaviour. Three categories matter most:

  • An absorbing state is one you can never leave: once the chain enters it, it stays there forever. A self-loop with probability 1 is the defining feature. A simple example is a gambler who reaches £0 in a gambling game with no credit — they are stuck.
  • A transient state is one you are not guaranteed to return to. If you leave a transient state there is some positive probability of never coming back. In many models early states are transient — the chain passes through them and moves on.
  • A recurrent state (also called a persistent state) is one you are guaranteed to return to eventually, with probability 1. In the weather model both Sunny and Rainy are recurrent: no matter which state you are in, you will eventually visit both states again and again.

The distinction between transient and recurrent states determines the chain’s long-run behaviour. A chain that only visits transient states will eventually leave them forever; a chain trapped in recurrent states will cycle among them indefinitely. Understanding this classification is the foundation for studying stationary distributions, which Section 9 covers in detail.


3. Transition Probabilities and Matrices

The rules governing how a Markov chain moves are captured in a transition matrix P, where P[i][j] is the probability of moving from state i to state j in one step. This section covers how to construct P, the constraints it must satisfy (rows sum to 1), and how to use matrix multiplication to compute multi-step probabilities.

Defining the transition matrix. Label the states 0, 1, …, n−1. The transition matrix P is an n × n array where entry P[i][j] gives the probability of moving to state j on the very next step, given that you are currently in state i. Because these are probabilities of mutually exclusive, exhaustive outcomes (from state i you must go somewhere), every row must sum to exactly 1 and every entry must lie between 0 and 1 inclusive. A matrix satisfying these two constraints is called a stochastic matrix (or row-stochastic matrix). Each row is itself a probability distribution over the next state.

The stochastic-matrix constraints, stated precisely. For an n-state chain:

P[i][j] >= 0        for all i, j
sum_j P[i][j] = 1   for every row i

A zero entry means the transition is impossible; a one means it is certain. Columns have no such constraint — column sums need not equal 1.

Multi-step probabilities via matrix multiplication. Suppose you start in state i at time 0. After one step the probability of being in state j is P[i][j]. After two steps you pass through some intermediate state k, so:

P^2[i][j] = sum_k  P[i][k] * P[k][j]

This is exactly the (i, j) entry of P × P = P². In general, the probability of going from state i to state j in exactly k steps is the (i, j) entry of P^k. If you encode your current uncertainty as a row vector π₀ — a probability distribution over all states — then after k steps your updated distribution is:

π_k = π₀ · P^k

Each right-multiplication by P advances the clock one tick and blends probabilities according to the transition rules.

Worked example — a two-state weather chain. Consider a model with two states: Sunny (state 0) and Rainy (state 1). From data:

  • If today is Sunny, tomorrow is Sunny with probability 0.8 and Rainy with probability 0.2.
  • If today is Rainy, tomorrow is Sunny with probability 0.4 and Rainy with probability 0.6.

Writing this as a matrix:

        Sunny  Rainy
Sunny [  0.8    0.2 ]
Rainy [  0.4    0.6 ]

Row 0 sums to 1.0; row 1 sums to 1.0. All entries are non-negative. P is a valid stochastic matrix.

One step. Start with certainty in Sunny: π₀ = [1, 0].

π₁ = π₀ · P = [1, 0] · [[0.8, 0.2], [0.4, 0.6]] = [0.8, 0.2]

Tomorrow: 80 % Sunny, 20 % Rainy.

Two steps. Apply P again:

π₂ = π₁ · P = [0.8, 0.2] · [[0.8, 0.2], [0.4, 0.6]]
             = [0.8*0.8 + 0.2*0.4,  0.8*0.2 + 0.2*0.6]
             = [0.72, 0.28]

Equivalently, compute P² once and read off the row for state 0:

P^2 = [[0.8*0.8 + 0.2*0.4,  0.8*0.2 + 0.2*0.6],
       [0.4*0.8 + 0.6*0.4,  0.4*0.2 + 0.6*0.6]]
    = [[0.72, 0.28],
       [0.56, 0.44]]

P²[0] = [0.72, 0.28] — matching the step-by-step result. Starting from Rainy gives P²[1] = [0.56, 0.44]; the two rows are already noticeably closer to each other than the original [0.8, 0.2] vs [0.4, 0.6]. As k grows, both rows converge toward the same limiting vector — the stationary distribution that Section 9 analyses in depth. The matrix-multiplication perspective makes this convergence precise and computable.


Part 2 — First Implementation

4. Exercise 1 — Weather Model

Goal: Build a two-state Markov chain in Rust that models daily weather as either Sunny or Rainy, driven by a transition matrix, and simulate 30 days of weather.

Setup

Create a new Cargo binary project and add the rand crate:

cargo new weather-chain
cd weather-chain
cargo add rand

Starter Code

Replace the contents of src/main.rs with the following skeleton. Do not change the struct layout or function signatures — your task is to fill in the todo!() bodies and write main.

use rand::Rng;

#[derive(Debug, Clone, Copy, PartialEq)]
enum Weather { Sunny, Rainy }

struct WeatherChain {
    /// transition[current][next] = probability
    transition: [[f64; 2]; 2],
}

impl WeatherChain {
    fn step(&self, current: Weather, rng: &mut impl Rng) -> Weather { todo!() }
    fn simulate(&self, start: Weather, steps: usize, rng: &mut impl Rng) -> Vec<Weather> { todo!() }
}

fn main() { todo!() }

Step 1 — Index conversion

step needs to index into self.transition using the current state, and later convert an index back to a Weather value. Add two helper methods to the Weather enum:

  • fn index(self) -> usize — returns 0 for Sunny, 1 for Rainy
  • fn from_index(i: usize) -> Self — returns Sunny for 0, Rainy for anything else

A simple match expression handles both. These are the only tools you need to bridge the enum and the matrix.

Step 2 — Implement WeatherChain::step

step must sample the next state from the probability row for the current state. The technique is a cumulative-probability walk:

  1. Retrieve the transition row: let row = self.transition[current.index()];
  2. Draw a uniform float in [0, 1): let r: f64 = rng.gen();
  3. Walk the row, accumulating probability. When the running total first exceeds r, return that state.

For the two-state case this reduces to a single comparison — if r < row[0] return Sunny, otherwise return Rainy — but implementing the loop works for any number of states and is worth practising.

Add a fallback Weather::from_index(row.len() - 1) after the loop to satisfy the compiler; floating-point rounding can in rare cases leave the loop without returning.

Step 3 — Implement WeatherChain::simulate

simulate runs the chain for steps transitions, collecting every state visited including the start:

  1. Allocate a Vec with capacity steps + 1.
  2. Push start and set current = start.
  3. Loop steps times: call self.step(current, rng), update current, and push.
  4. Return the Vec.

The returned slice will have length steps + 1 (the initial state plus one state per step).

Step 4 — Run the simulation

In main, create a WeatherChain with the two-state matrix from Section 3:

Sunny [0.8, 0.2]
Rainy [0.4, 0.6]

Seed a repeatable RNG with rand::rngs::SmallRng::seed_from_u64(42) (add use rand::SeedableRng;). Simulate 30 steps starting from Sunny, print the resulting sequence, and count how many days were sunny vs rainy.

Expected output structure (exact numbers vary by seed):

[Sunny, Sunny, Rainy, Sunny, ...]
Sunny days: 21 (67.7%)
Rainy days: 10 (32.3%)

Step 5 — Compare to the stationary distribution

The stationary distribution π satisfies π = πP, meaning once the chain reaches it, the distribution no longer changes. For this matrix, solve the two equations:

π₀ = 0.8·π₀ + 0.4·π₁
π₁ = 0.2·π₀ + 0.6·π₁
π₀ + π₁ = 1

The first equation simplifies to 0.2·π₀ = 0.4·π₁, giving π₀ = 2·π₁. Substituting into the normalisation constraint: π₀ = 2/3 ≈ 66.7%, π₁ = 1/3 ≈ 33.3%.

Print the stationary percentages alongside your simulated counts. With only 31 data points the match will be rough; re-run with 1 000 or 10 000 steps to see the empirical frequencies converge.

Reference Solution

Show full solution
use rand::{Rng, SeedableRng, rngs::SmallRng};

#[derive(Debug, Clone, Copy, PartialEq)]
enum Weather { Sunny, Rainy }

impl Weather {
    fn index(self) -> usize {
        match self {
            Weather::Sunny => 0,
            Weather::Rainy => 1,
        }
    }

    fn from_index(i: usize) -> Self {
        match i {
            0 => Weather::Sunny,
            _ => Weather::Rainy,
        }
    }
}

struct WeatherChain {
    /// transition[current][next] = probability
    transition: [[f64; 2]; 2],
}

impl WeatherChain {
    fn step(&self, current: Weather, rng: &mut impl Rng) -> Weather {
        let row = self.transition[current.index()];
        let r: f64 = rng.gen();
        let mut cumulative = 0.0;
        for (i, &prob) in row.iter().enumerate() {
            cumulative += prob;
            if r < cumulative {
                return Weather::from_index(i);
            }
        }
        Weather::from_index(row.len() - 1)
    }

    fn simulate(&self, start: Weather, steps: usize, rng: &mut impl Rng) -> Vec<Weather> {
        let mut states = Vec::with_capacity(steps + 1);
        states.push(start);
        let mut current = start;
        for _ in 0..steps {
            current = self.step(current, rng);
            states.push(current);
        }
        states
    }
}

fn main() {
    let mut rng = SmallRng::seed_from_u64(42);
    let chain = WeatherChain {
        transition: [[0.8, 0.2], [0.4, 0.6]],
    };

    let states = chain.simulate(Weather::Sunny, 30, &mut rng);
    println!("{:?}", states);

    let total = states.len() as f64;
    let sunny = states.iter().filter(|&&s| s == Weather::Sunny).count();
    let rainy = states.len() - sunny;

    println!("Sunny days: {} ({:.1}%)", sunny, 100.0 * sunny as f64 / total);
    println!("Rainy days: {} ({:.1}%)", rainy, 100.0 * rainy as f64 / total);
    println!("Stationary: Sunny ≈ 66.7%, Rainy ≈ 33.3%");
}

5. Exercise 2 — Simulating a Random Walk

Goal: Implement a one-dimensional random walk on the integers (states −N … +N) with reflecting boundaries, then measure the empirical distribution of positions after T steps.

A random walk is one of the simplest Markov chains: the state is a single integer position, and at each step the walker moves left or right according to a fixed probability. Adding reflecting boundaries means the walker cannot escape the interval — a step that would leave the interval is clamped to the nearest boundary.

Learning objectives

  • Model a finite integer state space in Rust
  • Implement reflecting (clamped) boundary conditions
  • Aggregate many simulation runs into a histogram
  • Visualise the distribution as an ASCII bar chart
  • Observe how asymmetric step probabilities skew the stationary distribution

Setup

You can extend the weather-chain project from Exercise 1, or create a fresh binary:

cargo new random-walk
cd random-walk
cargo add rand

You will also need use std::collections::HashMap; at the top of src/main.rs.

Starter Code

Replace the contents of src/main.rs with the following skeleton. Do not change the struct layout or function signatures — your task is to fill in the todo!() bodies.

use rand::Rng;
use std::collections::HashMap;

struct RandomWalk {
    min: i32,
    max: i32,
    /// prob_right[i] = probability of stepping right from position `min + i as i32`
    prob_right: Vec<f64>,
}

impl RandomWalk {
    fn step(&self, pos: i32, rng: &mut impl Rng) -> i32 { todo!() }
    fn simulate(&self, start: i32, steps: usize, rng: &mut impl Rng) -> Vec<i32> { todo!() }
    fn histogram(&self, start: i32, steps: usize, trials: usize, rng: &mut impl Rng)
        -> HashMap<i32, usize> { todo!() }
}

fn print_histogram(hist: &HashMap<i32, usize>, min: i32, max: i32, trials: usize) { todo!() }

fn main() { todo!() }

Step 1 — Constructor and index conversion

The prob_right vec is indexed from 0, but positions range from min to max. Add two helpers to RandomWalk:

  • fn new(min: i32, max: i32, p: f64) -> Self — creates a uniform walk where every position has the same step probability p. Fill prob_right with vec![p; (max - min + 1) as usize].
  • fn idx(&self, pos: i32) -> usize — converts a position to a vec index: (pos - self.min) as usize. Use this everywhere you index into prob_right.

Step 2 — Implement RandomWalk::step

step must sample the next position:

  1. Draw a uniform float r in [0, 1): let r: f64 = rng.gen();
  2. Look up the step probability: let p = self.prob_right[self.idx(pos)];
  3. Choose direction: if r < p, move right (pos + 1); otherwise move left (pos - 1)
  4. Apply the reflecting boundary by clamping: moved.clamp(self.min, self.max)
#![allow(unused)]
fn main() {
let moved = if r < p { pos + 1 } else { pos - 1 };
moved.clamp(self.min, self.max)
}

Clamping is the simplest reflecting rule: a step that would leave the interval is silently held at the boundary. An alternative is strict reflection (a step right from max lands at max - 1), but clamping is sufficient here.

Step 3 — Implement RandomWalk::simulate and RandomWalk::histogram

simulate runs the chain for steps transitions and collects every position visited:

  1. Allocate a Vec with capacity steps + 1.
  2. Push start and set current = start.
  3. Loop steps times: call self.step(current, rng), update current, push.
  4. Return the vec.

histogram runs many independent trials and records where each trial ends:

  1. Allocate an empty HashMap<i32, usize>.
  2. Loop trials times: call self.simulate(start, steps, rng), extract *positions.last().unwrap(), and increment its count with *hist.entry(pos).or_insert(0) += 1.
  3. Return the map.

The histogram answers: after steps steps starting from start, what fraction of the time does the walker end at each position?

Step 4 — Print an ASCII bar chart

Write print_histogram to display the distribution. For each position from min to max:

  1. Look up count = hist.get(&pos).copied().unwrap_or(0).
  2. Scale to a bar: bar_len = count * bar_width / max_count (use bar_width = 40 and max_count = hist.values().copied().max().unwrap_or(1)).
  3. Print the position, a bar of # characters, and the percentage.
#![allow(unused)]
fn main() {
let bar: String = "#".repeat(bar_len);
println!("{:4}: {:40} ({:.1}%)", pos, bar, 100.0 * count as f64 / trials as f64);
}

Step 5 — Compare symmetric vs asymmetric walks

In main, run two experiments. Seed a repeatable RNG with SmallRng::seed_from_u64(42) (add use rand::{SeedableRng, rngs::SmallRng};).

Symmetric walk (p = 0.5, range −5 … 5):

#![allow(unused)]
fn main() {
let walk = RandomWalk::new(-5, 5, 0.5);
let hist = walk.histogram(0, 200, 10_000, &mut rng);
print_histogram(&hist, -5, 5, 10_000);
}

Asymmetric walk (p = 0.7, same range):

#![allow(unused)]
fn main() {
let walk = RandomWalk::new(-5, 5, 0.7);
let hist = walk.histogram(0, 200, 10_000, &mut rng);
print_histogram(&hist, -5, 5, 10_000);
}

Observe:

  • Symmetric (p = 0.5): after enough steps the distribution is approximately uniform — each of the 11 positions appears with roughly 9% frequency. This is the stationary distribution for a symmetric walk with clamping boundaries.
  • Asymmetric (p = 0.7): the walker drifts toward the right boundary; positions near +5 accumulate much higher frequencies than those near −5. The stationary distribution is geometric, rising steeply toward max.

Try lowering steps from 200 to 10 and observe that the distribution has not yet converged — it is still concentrated near start. This illustrates the mixing time of the chain.

Reference Solution

Show full solution
use rand::{Rng, SeedableRng, rngs::SmallRng};
use std::collections::HashMap;

struct RandomWalk {
    min: i32,
    max: i32,
    /// prob_right[i] = probability of stepping right from position `min + i as i32`
    prob_right: Vec<f64>,
}

impl RandomWalk {
    fn new(min: i32, max: i32, p: f64) -> Self {
        let n = (max - min + 1) as usize;
        Self { min, max, prob_right: vec![p; n] }
    }

    fn idx(&self, pos: i32) -> usize {
        (pos - self.min) as usize
    }

    fn step(&self, pos: i32, rng: &mut impl Rng) -> i32 {
        let r: f64 = rng.gen();
        let p = self.prob_right[self.idx(pos)];
        let moved = if r < p { pos + 1 } else { pos - 1 };
        moved.clamp(self.min, self.max)
    }

    fn simulate(&self, start: i32, steps: usize, rng: &mut impl Rng) -> Vec<i32> {
        let mut positions = Vec::with_capacity(steps + 1);
        let mut current = start;
        positions.push(current);
        for _ in 0..steps {
            current = self.step(current, rng);
            positions.push(current);
        }
        positions
    }

    fn histogram(
        &self,
        start: i32,
        steps: usize,
        trials: usize,
        rng: &mut impl Rng,
    ) -> HashMap<i32, usize> {
        let mut hist = HashMap::new();
        for _ in 0..trials {
            let positions = self.simulate(start, steps, rng);
            let final_pos = *positions.last().unwrap();
            *hist.entry(final_pos).or_insert(0) += 1;
        }
        hist
    }
}

fn print_histogram(hist: &HashMap<i32, usize>, min: i32, max: i32, trials: usize) {
    let max_count = hist.values().copied().max().unwrap_or(1);
    let bar_width = 40;
    for pos in min..=max {
        let count = hist.get(&pos).copied().unwrap_or(0);
        let bar_len = count * bar_width / max_count;
        let bar: String = "#".repeat(bar_len);
        println!("{:4}: {:40} ({:.1}%)", pos, bar, 100.0 * count as f64 / trials as f64);
    }
}

fn main() {
    let mut rng = SmallRng::seed_from_u64(42);

    println!("=== Symmetric walk (p = 0.5) ===");
    let walk = RandomWalk::new(-5, 5, 0.5);
    let hist = walk.histogram(0, 200, 10_000, &mut rng);
    print_histogram(&hist, -5, 5, 10_000);

    println!();

    println!("=== Asymmetric walk (p = 0.7) ===");
    let walk = RandomWalk::new(-5, 5, 0.7);
    let hist = walk.histogram(0, 200, 10_000, &mut rng);
    print_histogram(&hist, -5, 5, 10_000);
}

Expected output structure (exact percentages vary by seed):

=== Symmetric walk (p = 0.5) ===
  -5: ########################################  (9.3%)
  -4: #####################################     (8.6%)
  -3: ########################################  (9.4%)
  -2: #####################################     (8.7%)
  -1: #####################################     (8.7%)
   0: ####################################      (8.5%)
   1: #####################################     (8.8%)
   2: #####################################     (8.7%)
   3: #####################################     (8.9%)
   4: ####################################      (8.4%)
   5: #########################################  (9.6%)

=== Asymmetric walk (p = 0.7) ===
  -5:                                           (0.0%)
  -4:                                           (0.1%)
  -3: #                                         (0.4%)
  -2: ##                                        (1.2%)
  -1: #####                                     (3.6%)
   0: ###########                               (7.8%)
   1: ###################                      (13.4%)
   2: ###############################          (21.6%)
   3: ########################################  (27.8%)
   4: #######################################  (27.4%)
   5: ################################         (22.3%)

The symmetric walk converges to a nearly flat distribution — each position visited roughly 1/11 ≈ 9.1% of the time. The asymmetric walk piles up at the right boundary, with positions +3 … +5 capturing the bulk of the probability mass.


Part 3 — Text Generation

6. Text Generation with Markov Chains

Text can be modeled as a Markov chain where each state is a word (or sequence of words) and transitions represent which words tend to follow. This section explains the bigram model, why it produces surprisingly coherent short sequences, and what its limitations reveal about the relationship between statistical models and language. No code is written here — it prepares you for Exercises 3 and 4.

Words as states. To model text as a Markov chain, treat each word as a state and the act of writing the next word as a transition. Given the word “cat,” what word comes next? A Markov model answers that question by consulting statistics drawn from a training corpus: it scans every occurrence of “cat” and records which words followed it, then samples from that empirical distribution. The result is a sequence of words generated one step at a time, each word chosen probabilistically based only on the current word — not on the full sentence or paragraph that came before. This is the Markov property applied to language.

Bigrams and transition tables. A bigram is an ordered pair of adjacent words. To build a bigram model, scan the corpus left to right and record every consecutive word pair. Count how many times each pair appears, then for each word w express those counts as a probability distribution over successor words. This distribution forms one row of the transition table: a lookup from every word to a weighted list of what can follow it. The table is learned entirely from data — no grammar rules, no meaning, just co-occurrence statistics.

Worked example. Consider this two-sentence corpus:

“the cat sat on the mat. the cat sat on the hat.”

Scanning the corpus (treating each sentence as a word sequence and ignoring the periods) yields the following bigrams and their counts. Dividing each count by the row total gives the transition probability:

Current wordNext wordCountProbability
thecat20.50
themat10.25
thehat10.25
catsat21.00
saton21.00
onthe21.00

Notice that “cat,” “sat,” and “on” each have only one possible successor — the small corpus left them no choice. Only “the” branches: half the time it is followed by “cat,” a quarter of the time by “mat,” and a quarter by “hat.” Starting from “the,” one plausible generated sequence is: the → cat → sat → on → the → hat — and then “hat” would be an end-of-sentence state.

Why the text sounds strange. Short stretches of output are surprisingly readable: every adjacent pair the model produces appeared in the training data, so each local transition is both grammatically and semantically plausible. The problem surfaces over longer spans. Because the model has no memory beyond the current word, it cannot maintain a topic, complete a thought, or avoid contradictions introduced a few sentences back. The result resembles text written by someone who half-knows the language: each individual step looks right, but the destination keeps shifting without purpose.

Order-n chains. A bigram model is an order-1 Markov chain — the next state depends on exactly one previous word. An order-2 model (a trigram model) conditions on the last two words; order n uses the last n words as context. Increasing n brings sharply improved local coherence — at n = 3 or 4 the output begins to reproduce full phrases from the corpus verbatim — but at the cost of novelty. A high-order model has likely seen most of its context only once, so it often has no real choice but to copy the source text rather than recombine it. The right tradeoff depends on corpus size: larger corpora can support higher n without the model simply memorising its training data.


7. Exercise 3 — Bigram Text Generator

Goal: Build a Markov chain over words from an input text. Each state is a single word; transitions are learned from the corpus. Generate novel word sequences of a given length from a chosen seed word.

Setup

You can extend the random-walk project from Exercise 2, or create a fresh binary:

cargo new bigram-text
cd bigram-text
cargo add rand

You will also need use std::collections::HashMap; at the top of src/main.rs.

Starter Code

Replace the contents of src/main.rs with the following skeleton. Do not change the struct layout or function signatures — your task is to fill in the todo!() bodies and write main.

use rand::Rng;
use std::collections::HashMap;

struct BigramModel {
    /// transitions[word] = list of (next_word, weight) pairs
    transitions: HashMap<String, Vec<(String, usize)>>,
}

impl BigramModel {
    fn train(corpus: &str) -> Self { todo!() }
    fn generate(&self, seed: &str, length: usize, rng: &mut impl Rng) -> Vec<String> { todo!() }
}

fn main() { todo!() }

Step 1 — Tokenize the corpus

Inside train, split the corpus into lowercase words:

#![allow(unused)]
fn main() {
let words: Vec<String> = corpus
    .split_whitespace()
    .map(|s| s.to_lowercase())
    .collect();
}

split_whitespace handles any run of spaces, newlines, or tabs. Lowercasing ensures “The” and “the” are treated as the same state. For this exercise we leave punctuation attached to words (so “sat.” and “sat” are distinct) to keep the code simple — stripping it is a good stretch goal.

Step 2 — Build the transition table

Iterate every consecutive word pair using windows(2). The first word is the current state; the second is the successor to record:

#![allow(unused)]
fn main() {
for window in words.windows(2) {
    let current = window[0].clone();
    let next    = window[1].clone();
    let entry = transitions.entry(current).or_default();
    if let Some(pair) = entry.iter_mut().find(|(w, _)| w == &next) {
        pair.1 += 1;
    } else {
        entry.push((next, 1));
    }
}
}

For each word, build a list of (successor_word, count) pairs. If the successor already appears in the list, increment its count; otherwise push a new entry with count 1. After scanning the whole corpus, every word maps to a weighted list of what can follow it — exactly one row of the bigram transition table.

Return BigramModel { transitions }.

Step 3 — Implement weighted random sampling

generate must sample the next word proportionally to its observed count in the successor list. Add a helper function outside impl BigramModel that performs integer weighted sampling — faster and numerically exact compared to floating-point accumulation:

#![allow(unused)]
fn main() {
fn sample_weighted<'a>(choices: &'a [(String, usize)], rng: &mut impl Rng) -> &'a str {
    let total: usize = choices.iter().map(|(_, w)| w).sum();
    let mut r = rng.gen_range(0..total);
    for (word, weight) in choices {
        if r < *weight {
            return word;
        }
        r -= weight;
    }
    &choices.last().unwrap().0   // fallback: satisfies the compiler
}
}

The r -= weight trick avoids floating-point comparisons: draw a random index in [0, total), then walk the list subtracting each bucket’s weight until the remaining value lands inside the current entry. This is the discrete-distribution equivalent of the cumulative-probability walk used in Exercise 1, rewritten in integer arithmetic.

Step 4 — Implement BigramModel::generate

generate starts from a seed word and appends one word at a time:

  1. Push the seed onto output and set current = seed.to_string().
  2. Loop up to length times: look up current in self.transitions and call sample_weighted on the result.
  3. Push the sampled word onto output and advance current to it.
  4. If current is not in the transition table (a dead end — the word never appeared mid-sentence in the corpus), stop early with break.
#![allow(unused)]
fn main() {
let mut output = vec![seed.to_string()];
let mut current = seed.to_string();

for _ in 0..length {
    match self.transitions.get(&current) {
        None => break,
        Some(choices) => {
            let next = sample_weighted(choices, rng).to_string();
            output.push(next.clone());
            current = next;
        }
    }
}
output
}

Step 5 — Run the generator

In main, train on a short corpus and generate several sequences from different seed words. Use a seeded RNG for reproducibility (add use rand::{SeedableRng, rngs::SmallRng};):

const CORPUS: &str =
    "alice was beginning to get very tired of sitting by her sister on the \
     bank and of having nothing to do once or twice she had peeped into the \
     book her sister was reading but it had no pictures or conversations in \
     it and what is the use of a book thought alice without pictures or \
     conversations alice was beginning to get very tired of sitting";

fn main() {
    let mut rng = SmallRng::seed_from_u64(42);
    let model = BigramModel::train(CORPUS);

    for seed in &["alice", "the", "book"] {
        let words = model.generate(seed, 15, &mut rng);
        println!("{}", words.join(" "));
    }
}

Expected output structure (exact words vary by seed):

alice was beginning to get very tired of sitting by her sister was beginning
the use of a book her sister on the bank and of sitting by her sister
book thought alice without pictures or conversations alice was beginning to get

Observe that every adjacent pair in the output appeared in the training corpus — each local transition is faithful to the source text. But over longer spans the topic shifts erratically, because the model has no memory beyond the immediately preceding word.

Step 6 — Try a larger corpus

The small Alice snippet produces repetitive output because many words have only one possible successor in a short text. To see genuine branching behaviour, download the first chapter of Alice’s Adventures in Wonderland from Project Gutenberg (plain text, freely available) and feed the full text to BigramModel::train. With more data:

  • Common words like “the” and “and” branch to dozens of successors.
  • Generated sequences stay on-topic for longer stretches before the topic shifts.
  • The difference between the single-word context used here and the two-word context in Exercise 4 becomes immediately obvious.

You can embed the text file directly in the binary with Rust’s include_str! macro:

#![allow(unused)]
fn main() {
const CORPUS: &str = include_str!("../corpus/alice_ch1.txt");
}

Reference Solution

Show full solution
use rand::{Rng, SeedableRng, rngs::SmallRng};
use std::collections::HashMap;

struct BigramModel {
    /// transitions[word] = list of (next_word, weight) pairs
    transitions: HashMap<String, Vec<(String, usize)>>,
}

impl BigramModel {
    fn train(corpus: &str) -> Self {
        let words: Vec<String> = corpus
            .split_whitespace()
            .map(|s| s.to_lowercase())
            .collect();

        let mut transitions: HashMap<String, Vec<(String, usize)>> = HashMap::new();

        for window in words.windows(2) {
            let current = window[0].clone();
            let next    = window[1].clone();
            let entry = transitions.entry(current).or_default();
            if let Some(pair) = entry.iter_mut().find(|(w, _)| w == &next) {
                pair.1 += 1;
            } else {
                entry.push((next, 1));
            }
        }

        BigramModel { transitions }
    }

    fn generate(&self, seed: &str, length: usize, rng: &mut impl Rng) -> Vec<String> {
        let mut output = vec![seed.to_string()];
        let mut current = seed.to_string();

        for _ in 0..length {
            match self.transitions.get(&current) {
                None => break,
                Some(choices) => {
                    let next = sample_weighted(choices, rng).to_string();
                    output.push(next.clone());
                    current = next;
                }
            }
        }
        output
    }
}

fn sample_weighted<'a>(choices: &'a [(String, usize)], rng: &mut impl Rng) -> &'a str {
    let total: usize = choices.iter().map(|(_, w)| w).sum();
    let mut r = rng.gen_range(0..total);
    for (word, weight) in choices {
        if r < *weight {
            return word;
        }
        r -= weight;
    }
    &choices.last().unwrap().0
}

const CORPUS: &str =
    "alice was beginning to get very tired of sitting by her sister on the \
     bank and of having nothing to do once or twice she had peeped into the \
     book her sister was reading but it had no pictures or conversations in \
     it and what is the use of a book thought alice without pictures or \
     conversations alice was beginning to get very tired of sitting";

fn main() {
    let mut rng = SmallRng::seed_from_u64(42);
    let model = BigramModel::train(CORPUS);

    for seed in &["alice", "the", "book"] {
        let words = model.generate(seed, 15, &mut rng);
        println!("{}", words.join(" "));
    }
}

8. Exercise 4 — N-gram Generalization

Goal: Generalize the bigram model to an n-gram model where each state is a window of n consecutive words. Compare the output quality for n = 1, 2, 3, and 4 on the same corpus.

Setup

Extend the project from Exercise 3, or create a fresh one:

cargo new ngram-chain
cd ngram-chain
cargo add rand

The NgramModel stores transitions keyed by Vec<String> — a window of n consecutive words. Vec<String> implements Hash and Eq, so it works directly as a HashMap key with no extra wrapping.

Starter Code

Replace (or extend) src/main.rs with the following skeleton:

use rand::Rng;
use std::collections::HashMap;

struct NgramModel {
    n: usize,
    transitions: HashMap<Vec<String>, Vec<(String, usize)>>,
}

impl NgramModel {
    fn train(corpus: &str, n: usize) -> Self { todo!() }
    fn generate(&self, seed: Vec<String>, length: usize, rng: &mut impl Rng) -> Vec<String> { todo!() }
}

fn main() { todo!() }

Step 1 — Tokenize and build the transition table

Inside train, convert the corpus into a flat list of lowercase words:

#![allow(unused)]
fn main() {
let words: Vec<String> = corpus
    .split_whitespace()
    .map(|s| s.to_lowercase())
    .collect();
}

Then iterate over sliding windows of size n + 1. The first n elements form the key (the context); the last element is the successor word to record:

#![allow(unused)]
fn main() {
for window in words.windows(n + 1) {
    let key: Vec<String> = window[..n].to_vec();
    let next: String = window[n].clone();
    // insert (key → next) into the transition table
}
}

Use HashMap::entry(...).or_default() to get-or-insert the successor list. Scan for an existing entry for next and increment its count, or push (next, 1) if it is new.

Step 2 — Weighted sampling helper

Reuse the same cumulative-weight technique from Exercise 3. The helper below accepts any slice of (word, count) pairs and returns a randomly chosen word with probability proportional to its count:

#![allow(unused)]
fn main() {
fn sample<'a>(choices: &'a [(String, usize)], rng: &mut impl Rng) -> &'a str {
    let total: usize = choices.iter().map(|(_, w)| w).sum();
    let mut r = rng.gen_range(0..total);
    for (word, weight) in choices {
        if r < *weight {
            return word;
        }
        r -= weight;
    }
    &choices.last().unwrap().0
}
}

Step 3 — Implement NgramModel::generate

generate is called with a seed — a Vec<String> of exactly n words. It extends that seed word by word, up to length additional words, using a sliding window to track the current context:

  1. Copy the seed into output and into a VecDeque<String> called window.
  2. Each step: collect window into a Vec<String> to use as the lookup key.
  3. If the key is missing from self.transitions, stop early — the chain has reached a dead end.
  4. Sample the next word, push it onto output, pop the oldest word from window, push the new word.
#![allow(unused)]
fn main() {
use std::collections::VecDeque;

// inside generate:
let mut window: VecDeque<String> = VecDeque::from(seed.clone());
let mut output = seed;

for _ in 0..length {
    let key: Vec<String> = window.iter().cloned().collect();
    match self.transitions.get(&key) {
        None => break,
        Some(choices) => {
            let next = sample(choices, rng).to_string();
            output.push(next.clone());
            window.pop_front();
            window.push_back(next);
        }
    }
}
output
}

Step 4 — Compare n = 1, 2, 3, 4

In main, train one model per value of n and generate 50 additional words from each. Use a fixed RNG seed for reproducibility. The short Alice corpus below is enough to observe the trend; swap in a larger public-domain text (e.g., the first chapter of Alice’s Adventures in Wonderland from Project Gutenberg) for more interesting output.

const CORPUS: &str =
    "alice was beginning to get very tired of sitting by her sister on the \
     bank and of having nothing to do once or twice she had peeped into the \
     book her sister was reading but it had no pictures or conversations in \
     it and what is the use of a book thought alice without pictures or \
     conversations alice was beginning to get very tired of sitting";

fn main() {
    use rand::{SeedableRng, rngs::SmallRng};

    for n in 1..=4 {
        let model = NgramModel::train(CORPUS, n);
        let mut rng = SmallRng::seed_from_u64(42);

        // seed = the first n words of the corpus
        let seed: Vec<String> = CORPUS
            .split_whitespace()
            .take(n)
            .map(|s| s.to_lowercase())
            .collect();

        let words = model.generate(seed, 50, &mut rng);
        println!("n={}: {}", n, words.join(" "));
        println!();
    }
}

Expected observations:

  • n = 1: no context at all; the model samples from the global word-frequency distribution, producing word soup with only rough statistical flavour.
  • n = 2 (bigram): every adjacent pair appeared in the corpus, so individual transitions feel plausible; the topic still shifts erratically over longer runs.
  • n = 3 (trigram): longer coherent stretches emerge; you will start to recognise verbatim phrases from the corpus.
  • n = 4: on a small corpus, most 4-word contexts appear only once, leaving the model no real choice but to reproduce the training text nearly verbatim. Try a larger corpus to see n = 4 produce novel output.

Step 5 — Memorisation vs novelty

The pattern above is the central tension in all n-gram language models:

  • Small n: short context → many plausible continuations → high novelty, low coherence.
  • Large n: long context → typically a unique continuation → low novelty, high local fidelity to the corpus.

The corpus size determines the crossover point. For a paragraph-sized text, n = 2 is usually the maximum useful order. For a novel-length corpus, n = 4 or 5 can produce readable, novel output without simply transcribing the source.

Stretch Goal — Character-level n-grams

Swap words for individual characters: tokenize with .chars() instead of .split_whitespace(). The model and sampling logic are unchanged — only the “token” definition shifts from words to characters:

#![allow(unused)]
fn main() {
// Character-level tokenization for train:
let chars: Vec<String> = corpus.chars().map(|c| c.to_string()).collect();
// Replace `words` with `chars` and proceed identically.
}

Generate 200 characters at n = 3, 5, and 8. You will see the model progress from random letter sequences, to plausible letter clusters and syllables, to recognisable words and phrases as n grows. This also demonstrates that the n-gram approach is domain-agnostic: the same code works for words, characters, DNA bases, MIDI note sequences, or any discrete token stream.

Reference Solution

Show full solution
use rand::{Rng, SeedableRng, rngs::SmallRng};
use std::collections::{HashMap, VecDeque};

struct NgramModel {
    n: usize,
    transitions: HashMap<Vec<String>, Vec<(String, usize)>>,
}

impl NgramModel {
    fn train(corpus: &str, n: usize) -> Self {
        let words: Vec<String> = corpus
            .split_whitespace()
            .map(|s| s.to_lowercase())
            .collect();

        let mut transitions: HashMap<Vec<String>, Vec<(String, usize)>> = HashMap::new();

        for window in words.windows(n + 1) {
            let key: Vec<String> = window[..n].to_vec();
            let next = window[n].clone();
            let entry = transitions.entry(key).or_default();
            if let Some(pair) = entry.iter_mut().find(|(w, _)| w == &next) {
                pair.1 += 1;
            } else {
                entry.push((next, 1));
            }
        }

        NgramModel { n, transitions }
    }

    fn generate(&self, seed: Vec<String>, length: usize, rng: &mut impl Rng) -> Vec<String> {
        assert_eq!(seed.len(), self.n, "seed must have exactly n words");
        let mut window: VecDeque<String> = VecDeque::from(seed.clone());
        let mut output = seed;

        for _ in 0..length {
            let key: Vec<String> = window.iter().cloned().collect();
            match self.transitions.get(&key) {
                None => break,
                Some(choices) => {
                    let next = sample(choices, rng).to_string();
                    output.push(next.clone());
                    window.pop_front();
                    window.push_back(next);
                }
            }
        }
        output
    }
}

fn sample<'a>(choices: &'a [(String, usize)], rng: &mut impl Rng) -> &'a str {
    let total: usize = choices.iter().map(|(_, w)| w).sum();
    let mut r = rng.gen_range(0..total);
    for (word, weight) in choices {
        if r < *weight {
            return word;
        }
        r -= weight;
    }
    &choices.last().unwrap().0
}

const CORPUS: &str =
    "alice was beginning to get very tired of sitting by her sister on the \
     bank and of having nothing to do once or twice she had peeped into the \
     book her sister was reading but it had no pictures or conversations in \
     it and what is the use of a book thought alice without pictures or \
     conversations alice was beginning to get very tired of sitting";

fn main() {
    for n in 1..=4 {
        let model = NgramModel::train(CORPUS, n);
        let mut rng = SmallRng::seed_from_u64(42);

        let seed: Vec<String> = CORPUS
            .split_whitespace()
            .take(n)
            .map(|s| s.to_lowercase())
            .collect();

        let words = model.generate(seed, 50, &mut rng);
        println!("n={}: {}", n, words.join(" "));
        println!();
    }
}

Part 4 — Deeper Concepts

9. Stationary Distributions

A stationary distribution π is a probability distribution over states that is unchanged by one step of the chain: π P = π. This section covers how to find stationary distributions analytically (for small chains) and via power iteration, and explains when they exist and are unique — introducing the concepts of irreducibility and aperiodicity.

The stationary equation. Write the state distribution as a row vector π = [π₀, π₁, …, π_{n−1}]. Then π is stationary if it satisfies two simultaneous conditions:

πP = π           (fixed-point equation)
Σ πᵢ = 1         (normalisation)
πᵢ ≥ 0  for all i

The fixed-point equation says that right-multiplying π by the transition matrix returns exactly π: one step of the chain leaves the distribution unchanged. In terms of individual entries, the condition expands to:

πⱼ = Σᵢ πᵢ · P[i][j]   for every j

The probability mass flowing into state j from every state i — weighted by how likely the chain is to be in each i — exactly equals the current probability already in state j. No state is gaining or losing probability over time; the distribution is in balance.

Long-run frequency interpretation. The simulations in Exercise 1 showed this balance in practice: after thousands of steps, the fraction of time the chain occupied each state stabilised near 2/3 Sunny and 1/3 Rainy, regardless of the starting state. This is the ergodic theorem for Markov chains: under appropriate conditions, the time-average frequency of visiting state i converges to πᵢ almost surely. The stationary distribution is not merely a mathematical fixed point — it is the long-run proportion of time the chain spends in each state, and it is precisely what your simulations were converging toward.

Irreducibility. A chain is irreducible if every state is reachable from every other state in some finite number of steps. Formally, for every pair (i, j) there exists k ≥ 1 with P^k[i][j] > 0. An irreducible chain cannot get permanently trapped in a proper subset of states; all states form a single communicating class. A chain with an absorbing state (one where P[i][i] = 1 and the chain can enter but never leave) is not irreducible. Irreducibility is what rules out a chain splitting into separate “islands” each with its own local stationary distribution.

Aperiodicity. A state has period d if returns to that state can only happen in a number of steps that is a multiple of d. Formally, d = gcd{k ≥ 1 : P^k[i][i] > 0}. A state is aperiodic if d = 1, and a chain is aperiodic if every state is. An aperiodic chain does not oscillate on a fixed cycle — there is no rhythm forcing the chain to return to state i only every 2 or every 5 steps. A self-loop (P[i][i] > 0) immediately makes state i aperiodic, because the chain can return after 1 step, so the gcd collapses to 1. A finite, irreducible, aperiodic chain is called ergodic, and such a chain is guaranteed to have exactly one stationary distribution. Furthermore, starting from any initial distribution π₀, the sequence π₀ P^k converges to π as k → ∞. The weather chain is both irreducible (Sunny → Rainy and Rainy → Sunny are both reachable in one step) and aperiodic (both states have positive self-loop probabilities: P[0][0] = 0.8 and P[1][1] = 0.6).

Analytical solution for the weather chain. To solve π P = π for the 2-state chain, write out the equation component-by-component. With π = [π₀, π₁] and the weather transition matrix:

πP = π   expands to:

  π₀ · P[0][0] + π₁ · P[1][0] = π₀     →     0.8·π₀ + 0.4·π₁ = π₀
  π₀ · P[0][1] + π₁ · P[1][1] = π₁     →     0.2·π₀ + 0.6·π₁ = π₁

Both equations carry the same information (one follows from the other because the rows of P sum to 1 and the probabilities sum to 1), so use only the first and add the normalisation constraint:

0.8·π₀ + 0.4·π₁ = π₀
            0.4·π₁ = 0.2·π₀
                π₀ = 2·π₁

Normalisation:  π₀ + π₁ = 1
             2·π₁ + π₁ = 1
                    π₁ = 1/3 ≈ 0.333
                    π₀ = 2/3 ≈ 0.667

The unique stationary distribution is π ≈ [0.667, 0.333]: about two-thirds of all days are Sunny and one-third are Rainy in the long run. This matches the 66.7 % / 33.3 % figures from the Exercise 1 simulation.

Power iteration. For chains with many states, setting up and solving the linear system π P = π directly is impractical. Power iteration is a standard numerical alternative: start from any distribution and repeatedly apply P until successive distributions are indistinguishable. Convergence to the unique π is guaranteed for ergodic chains.

// Pseudocode: power iteration for stationary distribution
let π = uniform distribution over all n states
loop:
    let π_next = π · P          // one step of the chain
    if max |π_next[i] - π[i]| < tolerance:
        break
    π = π_next
// π now approximates the stationary distribution

A concrete Rust sketch for the 2-state weather chain:

fn power_iteration(p: [[f64; 2]; 2], tolerance: f64) -> [f64; 2] {
    let mut pi = [0.5_f64, 0.5];
    loop {
        let pi_next = [
            pi[0] * p[0][0] + pi[1] * p[1][0],
            pi[0] * p[0][1] + pi[1] * p[1][1],
        ];
        let diff = (pi_next[0] - pi[0]).abs().max((pi_next[1] - pi[1]).abs());
        pi = pi_next;
        if diff < tolerance {
            break;
        }
    }
    pi
}

fn main() {
    let p = [[0.8, 0.2], [0.4, 0.6]];
    let pi = power_iteration(p, 1e-9);
    println!("π ≈ [{:.6}, {:.6}]", pi[0], pi[1]);
    // Output: π ≈ [0.666667, 0.333333]
}

Running this for the weather chain converges in roughly 60 iterations and agrees with the analytical result. For an n-state chain, each iteration costs O(n²) and the number of iterations needed scales inversely with the spectral gap — the difference between the two largest eigenvalues of P. A large spectral gap means fast convergence; a gap close to zero means the chain mixes slowly and power iteration requires many steps.


10. Applications and Further Reading

Markov chains appear throughout computer science and mathematics: PageRank, MCMC sampling, hidden Markov models, reinforcement learning, and more. This section surveys these applications at a high level and points to books, papers, and courses for learners who want to go deeper on any thread.

Application Survey

PageRank. Google’s original ranking algorithm modeled the web as a Markov chain: each page is a state, and each hyperlink is a transition with probability proportional to the number of outgoing links on the source page. A small “teleportation” probability was added so a random surfer occasionally jumps to a uniformly random page rather than following a link, ensuring the chain is irreducible and has a unique stationary distribution. The stationary probability of each page — the fraction of time a random surfer spends there in the long run — becomes its rank. Because the web had billions of pages, computing the stationary distribution via power iteration rather than direct matrix inversion was essential; the same convergence guarantee from Section 9 applies at planetary scale.

Markov Chain Monte Carlo (MCMC). Bayesian inference often requires integrating over high-dimensional parameter spaces where the posterior distribution has no closed form. MCMC methods solve this by constructing a Markov chain whose stationary distribution is the target posterior, then running the chain long enough that its samples approximate draws from that distribution. The Metropolis-Hastings algorithm is the foundational recipe: propose a move to a new state, accept it with a probability that preserves detailed balance, and reject it otherwise. Variants such as Gibbs sampling, Hamiltonian Monte Carlo, and the No-U-Turn Sampler (NUTS) power nearly every modern probabilistic programming framework, from Stan to PyMC.

Hidden Markov Models (HMMs). An HMM separates a Markov chain into two layers: a hidden state sequence that evolves according to a transition matrix, and an observation sequence where each hidden state emits an observable symbol with some probability. The key insight is that the true states are never directly seen — only the observations are. HMMs were the dominant approach in speech recognition for decades (phonemes as hidden states, acoustic features as observations) and remain central in bioinformatics for gene prediction and sequence segmentation. The Viterbi algorithm finds the most likely hidden state path for a given observation sequence in time linear in the sequence length; the Baum-Welch algorithm trains HMMs from unlabelled data using expectation-maximisation.

Reinforcement Learning. Most reinforcement learning problems are formulated as Markov Decision Processes (MDPs), which augment a Markov chain with a set of actions and a scalar reward signal. At each step the agent chooses an action, the environment transitions to a new state according to a transition distribution that depends on both the current state and the chosen action, and the agent receives a reward. The goal is to find a policy — a rule mapping states to actions — that maximises cumulative discounted reward. Because the next state depends only on the current state and action (not the history), the Markov property is what makes value-function algorithms like Q-learning and policy-gradient methods tractable. Sutton and Barto’s textbook (listed below) treats this connection in full rigour.

Bioinformatics — Sequence Analysis. DNA and protein sequences are naturally modelled as Markov chains over an alphabet of bases or amino acids. A simple k-th order Markov model assigns probabilities to short subsequences and can distinguish coding regions from non-coding regions: CpG islands in mammalian genomes, for example, have transition probabilities measurably different from the genomic background. Profile HMMs generalise this to align whole families of sequences — each column in a multiple sequence alignment becomes a hidden state with its own emission distribution, allowing robust database search even for distantly related proteins.

Queueing Theory. The classic M/M/1 queue — arrivals according to a Poisson process, exponential service times, a single server — is a continuous-time Markov chain on the non-negative integers, where the state is the number of customers in the system. Its stationary distribution is geometric, giving simple closed-form expressions for average queue length and waiting time. More complex queueing networks (multiple servers, priorities, finite buffers) extend the same framework and are used to size data-centre infrastructure, analyse hospital emergency departments, and design network switches. Continuous-time Markov chains replace the transition matrix with a generator matrix Q whose off-diagonal entries are transition rates rather than probabilities.

Language Models. The n-gram models from Exercises 3 and 4 are finite-order Markov chains over word tokens, and they directly preceded modern neural language models. In the 1990s and 2000s, trigram and 4-gram models with smoothing (Kneser-Ney, Witten-Bell) were the state of the art for machine translation and speech recognition. Neural language models replaced explicit Markov structure with learned representations, but the conceptual scaffolding is the same: predict the next token from a bounded window of context. Understanding the Markov chain view of language — its local coherence, its lack of long-range memory, its reliance on corpus statistics — clarifies both what earlier systems got right and what neural approaches had to learn to transcend.


Further Reading

Books

  • Blitzstein & Hwang — Introduction to Probability (2nd ed.) A beautifully written undergraduate probability text with a full chapter on Markov chains. The authors’ lecture videos and course materials are freely available; a free PDF of the book is offered on the book’s companion site. Start here if your probability background is thin or if you want every concept illustrated with concrete examples before the formalism arrives.

  • Norris — Markov Chains The standard rigorous treatment at the advanced-undergraduate / early-graduate level. Covers discrete- and continuous-time chains, convergence, reversibility, and applications with full proofs. Dense but thorough; worth working through if you intend to read research papers that use Markov chains as a theoretical tool.

  • Sutton & Barto — Reinforcement Learning: An Introduction (2nd ed.) The canonical RL textbook, freely available as a PDF from the authors. Chapters 3 and 4 formalise MDPs and dynamic programming using exactly the Markov chain machinery developed in this course. Reading those chapters after completing this course is a natural next step toward understanding how modern game-playing and robotics agents are designed.

Articles

  • Metropolis-Hastings algorithm — Wikipedia. A well-maintained article that covers the algorithm statement, acceptance-ratio derivation, intuition for why detailed balance implies the correct stationary distribution, and pseudocode. A good companion to the original 1953 Metropolis et al. paper (five pages, freely available) and the 1970 Hastings generalisation.

Rust Ecosystem Pointers

The exercises in this course used rand for sampling. A few other crates are useful as you build more serious probabilistic programs in Rust:

  • nalgebra — a comprehensive linear-algebra library covering vectors, dense matrices, and decompositions. Use it to compute P^k via repeated matrix multiplication or to solve the stationary-distribution equations π P = π, Σ πᵢ = 1 as a linear system.

  • petgraph — graph data structures and algorithms. Markov chains are directed weighted graphs, and petgraph lets you represent them as such, run graph-theoretic algorithms (strongly connected components give you irreducible sub-chains), and visualise the structure via its Graphviz export.

  • statrs — a statistics library providing common probability distributions with their PDFs, CDFs, and samplers. Useful when building emission distributions for HMMs or when you need chi-squared tests to check whether simulated chain frequencies match theoretical stationary probabilities.

Vector Database Self-Guided Course

This document is a self-guided course on vector databases. It is organized into four parts: conceptual foundations, the internals of vector search systems, hands-on Rust exercises with Turso and sqlite-vec, and real-world application pipelines. Each section is either a reading lesson or a hands-on Rust programming exercise. Sections marked 🚧 are stubs whose full content is tracked in an nbd ticket — follow the ticket ID to find the detailed learning objectives and instructions.


Table of Contents

Part 1 — Foundations

  1. What Is a Vector?
  2. Embeddings
  3. Vector Similarity

Part 2 — Vector Databases

  1. What Is a Vector Database?
  2. Under the Hood: ANN Algorithms

Part 3 — Turso + sqlite-vec Basics

  1. Setting Up
  2. Exercise 1 — Storing and Retrieving Vectors
  3. Exercise 2 — K-Nearest Neighbor Search

Part 4 — Real Applications

  1. Generating Embeddings in Rust
  2. Exercise 3 — Semantic Document Search
  3. Exercise 4 — Recommendation Engine
  4. Exercise 5 — Retrieval-Augmented Generation

Part 1 — Foundations

1. What Is a Vector?

A vector is an ordered list of numbers. That is the entire definition — nothing more exotic than a list where position matters. A two-element list [3.0, 4.0] is a vector; so is a 1 536-element list of floating-point values produced by a language model. What makes vectors useful is that the numbers have a geometric interpretation: each element is a coordinate along one axis of a space, and the vector as a whole names a point (or an arrow from the origin to that point) in that space.

Geometric intuition in two and three dimensions. Start with the familiar. A 2-dimensional vector [x, y] is a point in the plane — the kind you plot on graph paper. The vector [3.0, 4.0] sits three units to the right of the origin and four units up. An arrow drawn from [0, 0] to [3, 4] has a magnitude (length) of √(3² + 4²) = 5 and points in a specific direction. Magnitude and direction together completely characterise the vector; change either one and you have a different vector.

A 3-dimensional vector [x, y, z] extends this to physical space: three coordinates, three axes, one point. You can still compute a magnitude — √(x² + y² + z²) — and you can still talk about direction. Two 3D vectors point in the same direction if one is a positive scalar multiple of the other; they are perpendicular (orthogonal) if their dot product is zero.

High-dimensional spaces. Nothing in the definition of a vector limits it to two or three elements. A d-dimensional vector [x₁, x₂, …, x_d] is a point in d-dimensional space. The geometry extends perfectly: magnitude is √(x₁² + x₂² + … + x_d²), the dot product of two vectors is Σᵢ aᵢ · bᵢ, and you can compute angles and distances between points just as you would in 2D or 3D.

High-dimensional geometry is counterintuitive in subtle ways that are worth knowing:

  • The curse of dimensionality. In high-dimensional spaces, most of the volume of a hypersphere is concentrated near its surface rather than its interior. Two randomly chosen high-dimensional vectors from a standard distribution tend to be nearly orthogonal — their dot product is close to zero — even when you have not deliberately constructed them that way. This means “nearest neighbour” in high dimensions is a harder problem than it sounds: there are exponentially many directions, and nearby points can seem far away using simple distance measures.

  • Normalisation changes the geometry. A unit vector has magnitude exactly 1. Dividing a vector by its magnitude — normalisation — projects all vectors onto the surface of the unit hypersphere. On that sphere, distance and angle are equivalent measures of similarity, which simplifies many computations. Embedding models often output unit-normalised vectors precisely to exploit this equivalence.

  • Dimensions are not independent features. When people say a language model embeds words into a 768-dimensional space, they do not mean “dimension 42 encodes the concept of colour.” The axes of an embedding space are rarely interpretable on their own. Meaning is encoded in the relative positions of points — which vectors are close to which others — not in the values along any single axis.

Vectors as representations. The key insight that makes vector databases useful is that real-world objects — documents, images, audio clips, products, users — can be represented as vectors such that similarity in meaning or content corresponds to proximity in the vector space. Two documents that discuss the same topic will, if embedded well, produce vectors that are close together. Two documents on unrelated topics will produce vectors that are far apart.

This is not magic; it is the result of training a model to produce embeddings where similar inputs cluster near each other. Once you have such a model, every search or comparison problem reduces to a geometric problem: find the vectors closest to a query vector. The rest of this course is about how to do that efficiently at scale.

A note on notation. Throughout this course, vectors are written in bold or with subscripts: v, q, or v₁. The i-th element of a vector v is written v[i] or vᵢ. The magnitude of v is written |v| or ‖v‖. Dimension is written d and the number of stored vectors is written n.


2. Embeddings

Embeddings are the bridge between raw data and vector space. This section covers how language models, image encoders, and other neural networks learn to map heterogeneous inputs — words, sentences, images, products — into vectors where geometric proximity captures semantic similarity. 🚧 Full content tracked in [nbd:584e0c].


3. Vector Similarity

Once you have two vectors, how do you measure how alike they are? This section covers the three most common similarity functions used in vector search: cosine similarity, dot product, and Euclidean distance — their formulas, geometric interpretations, when each is appropriate, and the trade-offs in choosing between them. 🚧 Full content tracked in [nbd:99e1d9].


Part 2 — Vector Databases

4. What Is a Vector Database?

A vector database is a data store built around one core operation: given a query vector q, return the k stored vectors most similar to q. This section covers what that means in practice — approximate nearest-neighbour (ANN) search, the use cases that make vector databases essential (semantic search, recommendations, RAG), and how they differ from traditional relational or key-value databases. 🚧 Full content tracked in [nbd:d9f850].


5. Under the Hood: ANN Algorithms

Exact nearest-neighbour search over millions of high-dimensional vectors is too slow for production use. This section explains the two dominant approximate methods — HNSW (Hierarchical Navigable Small World graphs) and IVFFlat (Inverted File with flat quantisation) — their index construction, query-time traversal, and the recall vs. latency trade-off each exposes. 🚧 Full content tracked in [nbd:6ec5ff].


Part 3 — Turso + sqlite-vec Basics

6. Setting Up

This section walks through everything you need before writing a single SQL query: adding the right crates, opening a local Turso connection, and loading the sqlite-vec extension that gives SQLite vector-search superpowers.

What You Are Building

Turso is a SQLite-compatible database with built-in support for vector similarity search via the sqlite-vec extension. In local development you use a file-backed SQLite database; in production the same code points at a Turso cloud database. The libsql crate (the Rust client for Turso) speaks the Turso wire protocol and also handles local SQLite files transparently.

Cargo.toml

Create a new binary project and add the following dependencies:

cargo new vec-demo
cd vec-demo

Replace the [dependencies] section of Cargo.toml with:

[dependencies]
libsql = "0.9"
tokio = { version = "1", features = ["full"] }

libsql is the official Rust client for Turso / libSQL databases. It supports both local SQLite files and remote Turso connections with the same API, making it straightforward to develop locally and deploy to the cloud. tokio provides the async runtime — all libsql operations are async.

Add the release-build optimisation profile from the project conventions:

[profile.release]
opt-level = "z"
lto = true
strip = true
codegen-units = 1

Opening a Local Connection

Replace src/main.rs with the following:

use libsql::{Builder, Database};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let db: Database = Builder::new_local("vectors.db").build().await?;
    let conn = db.connect()?;

    // Verify the connection works
    let mut rows = conn.query("SELECT sqlite_version()", ()).await?;
    if let Some(row) = rows.next().await? {
        let version: String = row.get(0)?;
        println!("SQLite version: {version}");
    }

    Ok(())
}

Run it with cargo run. You should see output like:

SQLite version: 3.46.0

A file named vectors.db will appear in the current directory. This is a standard SQLite database — you can open it with any SQLite client to inspect its contents.

Enabling Vector Support with sqlite-vec

The libsql crate ships with sqlite-vec built in. No separate installation is required. Vector functions become available automatically once you use the right column types and functions in your SQL.

The key types and functions you will use throughout this course:

ConstructPurpose
F32_BLOB(d)Column type for storing a d-dimensional float32 vector
vector(json_array)Creates a vector from a JSON array literal
vector_extract(blob)Converts a stored vector blob back to a JSON array
vector_distance_cos(a, b)Cosine distance between two vectors (0 = identical, 2 = opposite)
libsql_vector_idx(col)Index type for fast approximate nearest-neighbour search
vector_top_k(table, query, k)Table-valued function: returns the k nearest rows to a query vector

Creating a Vector Table

Extend main to create a table that stores 3-dimensional float32 vectors:

#![allow(unused)]
fn main() {
conn.execute(
    "CREATE TABLE IF NOT EXISTS items (
         id      INTEGER PRIMARY KEY,
         label   TEXT NOT NULL,
         embedding F32_BLOB(3) NOT NULL
     )",
    (),
).await?;
}

F32_BLOB(3) declares a column that holds a 3-dimensional float32 vector stored as a binary blob. The 3 is the dimensionality — use the actual size of your embedding model’s output (e.g., F32_BLOB(768) for a 768-dimensional model) in real projects.

Creating a Vector Index

Without an index, nearest-neighbour search performs a full table scan — computing the distance from the query to every stored vector. For small tables this is fine; at scale you need an index:

#![allow(unused)]
fn main() {
conn.execute(
    "CREATE INDEX IF NOT EXISTS items_vec_idx
         ON items (embedding)
         USING libsql_vector_idx(embedding)",
    (),
).await?;
}

This creates an HNSW index over the embedding column. Queries that use vector_top_k will automatically use this index. The index is updated incrementally as rows are inserted or deleted — no manual rebuild is required.

Putting It Together

At this point your main.rs should look like this:

use libsql::{Builder, Database};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let db: Database = Builder::new_local("vectors.db").build().await?;
    let conn = db.connect()?;

    // Verify connection
    let mut rows = conn.query("SELECT sqlite_version()", ()).await?;
    if let Some(row) = rows.next().await? {
        let version: String = row.get(0)?;
        println!("SQLite version: {version}");
    }

    // Create vector table
    conn.execute(
        "CREATE TABLE IF NOT EXISTS items (
             id        INTEGER PRIMARY KEY,
             label     TEXT NOT NULL,
             embedding F32_BLOB(3) NOT NULL
         )",
        (),
    ).await?;

    // Create HNSW index
    conn.execute(
        "CREATE INDEX IF NOT EXISTS items_vec_idx
             ON items (embedding)
             USING libsql_vector_idx(embedding)",
        (),
    ).await?;

    println!("Database ready.");
    Ok(())
}

cargo run should print:

SQLite version: 3.46.0
Database ready.

You now have a working local vector database. Exercises 1 through 5 build on this foundation, adding data, querying it, and connecting the full embedding-to-search pipeline.


7. Exercise 1 — Storing and Retrieving Vectors

Goal: Insert a small set of labelled vectors into the items table created in §6, then retrieve them with a SELECT and deserialize the stored blob back into a Rust Vec<f32>. 🚧 Full content tracked in [nbd:081a55].


Goal: Use vector_top_k and vector_distance_cos to find the k vectors in the database most similar to a query vector, and display the results ranked by similarity score. 🚧 Full content tracked in [nbd:5674ce].


Part 4 — Real Applications

9. Generating Embeddings in Rust

Before you can search by meaning, you need a way to convert text into vectors. This section covers two approaches available in Rust: running a local embedding model with fastembed-rs (no API key, works offline, suited for smaller models) and calling an HTTP embedding API such as the OpenAI Embeddings endpoint (larger, higher-quality models at the cost of latency and a network dependency). 🚧 Full content tracked in [nbd:4c961f].


Goal: Build a complete semantic search pipeline: embed a small corpus of text documents, store the embeddings in Turso, then accept a natural-language query, embed it, and return the top-k most relevant documents using vector similarity — all without any keyword matching. 🚧 Full content tracked in [nbd:1ef9f4].


11. Exercise 4 — Recommendation Engine

Goal: Implement item-based collaborative filtering using vector similarity. Store item feature vectors (or learned item embeddings) in Turso, then given a target item, retrieve the k most similar items as recommendations. 🚧 Full content tracked in [nbd:e8be9a].


12. Exercise 5 — Retrieval-Augmented Generation

Goal: Combine vector search with a language model to build a retrieval-augmented generation (RAG) pipeline: given a user question, retrieve the most relevant passages from a document store using semantic search, inject them into a prompt as context, and stream the language model’s grounded answer back to the user. 🚧 Full content tracked in [nbd:5ed295].

Git Worktrees

What Are They?

A git worktree is a checked-out copy of a branch that lives in a separate directory on your filesystem, but shares the same underlying git repository (.git database) as your main clone.

Normally, a single git clone has one working directory — the files you see and edit. A worktree lets you have multiple working directories at once, each on a different branch, all tied to the same repository. You don’t clone the repo a second time; you project a second (or third, or fourth) working directory from the one clone.

~/.git/           ← shared: all objects, refs, config
~/projects/myapp/          ← main worktree (branch: main)
~/projects/myapp-feature/  ← linked worktree (branch: feature/auth)
~/projects/myapp-hotfix/   ← linked worktree (branch: hotfix/crash)

All three directories above read and write to the same .git object store. Commits made in any worktree are immediately visible to all others.


Core Commands

Add a worktree

# Check out an existing branch into a new directory
git worktree add ../myapp-feature feature/auth

# Create a new branch and check it out in one step
git worktree add -b feature/new-thing ../myapp-new-thing main

The first argument is the path for the new directory. The second is the branch name (or starting commit).

List worktrees

git worktree list

Example output:

/home/you/projects/myapp          abc1234 [main]
/home/you/projects/myapp-feature  def5678 [feature/auth]

Remove a worktree

# Remove the directory and deregister the worktree
git worktree remove ../myapp-feature

# Force-remove even if there are untracked or modified files
git worktree remove --force ../myapp-feature

After deleting a linked worktree directory manually (e.g. with rm -rf), git still knows about it. Clean up the stale reference with:

git worktree prune

Move a worktree

git worktree move ../myapp-feature ../myapp-auth-feature

Typical Workflows

Work on two branches simultaneously

You’re deep in a feature branch when a urgent bug report comes in. Instead of stashing your work or committing a WIP, you add a worktree:

git worktree add ../myapp-hotfix main
cd ../myapp-hotfix
# fix the bug, commit, push — then come back to your feature

Your feature working directory is untouched.

Run tests against a different branch without switching

git worktree add /tmp/myapp-test origin/release/2.0
cd /tmp/myapp-test
cargo test

You keep your editor open on main while tests run elsewhere.

Review a colleague’s PR locally

git fetch origin
git worktree add ../review-pr-123 origin/pr/123
cd ../review-pr-123
# read, run, evaluate — then remove when done
git worktree remove ../review-pr-123

How It Works Internally

When you run git worktree add, git:

  1. Creates a subdirectory inside .git/worktrees/<name>/ to store the worktree’s private state (its HEAD, index, and a back-reference to the working directory path).
  2. Writes a .git file (not a directory) into the new working directory that points back to .git/worktrees/<name>/.
  3. Checks out the requested branch into the new directory.

The object database (all commits, trees, blobs) is shared. Only the index and HEAD are per-worktree.


Limitations and When to Avoid Them

Each branch can only be checked out once

The most important constraint: a single branch cannot be checked out in two worktrees at the same time. Git enforces this to prevent the two trees from diverging their indexes silently. If you try, you’ll get an error:

fatal: 'feature/auth' is already checked out at '/home/you/projects/myapp'

Workaround: check out the second worktree at a specific commit (detached HEAD), or create a local tracking branch.

Tools that look for .git as a directory can break

Some tools assume .git is a directory, not a file. A linked worktree has a .git file instead, which confuses:

  • Older git GUIs that scan for .git/ directories
  • Some shell prompts and plugins that detect git repos naively
  • Scripts that do test -d .git

Build systems that use the working directory path

If your build system or language tooling caches absolute paths (e.g. Cargo’s target/ directory is placed relative to the workspace root), separate worktrees work fine — each has its own target/ directory. But if a tool bakes the source path into cached artifacts, switching between worktrees may cause spurious rebuilds or cache invalidation.

Hooks run in the worktree context, not the main repo

Git hooks in .git/hooks/ apply to all worktrees. But the working directory ($GIT_WORK_TREE) will be the linked worktree path, not the main repo path. Hook scripts that assume a fixed working directory can behave unexpectedly.

Not a substitute for branches in CI

Worktrees are a local developer convenience. Don’t model CI pipelines around them — use proper branch checkouts in ephemeral CI environments instead.

Submodules require extra care

Submodules are not automatically initialized in a new worktree. You need to run git submodule update --init inside each linked worktree where you need them.

Avoid for long-lived parallel development

If you find yourself maintaining two worktrees of the same repo for weeks at a time, that’s a signal you might want two separate clones or a proper branch strategy — not worktrees. Worktrees shine for short-lived parallel work (hotfixes, reviews, quick tests), not as a permanent parallel development setup.


Quick Reference

TaskCommand
Add a worktree for an existing branchgit worktree add <path> <branch>
Add a worktree and create a new branchgit worktree add -b <branch> <path> <start-point>
List all worktreesgit worktree list
Remove a worktreegit worktree remove <path>
Clean up stale worktree recordsgit worktree prune
Move a worktreegit worktree move <old-path> <new-path>

Writing a Lisp-to-C Compiler in Rust

This course walks you through building a complete, working compiler from scratch. You will write every component yourself — a lexer, a parser, a semantic analyser, and a code generator — ending with a program that reads MiniLisp source code and emits valid C. The compiler is written in Rust and uses the nom parser-combinator library for all parsing work.


Table of Contents

Part 1 — Foundations

  1. Introduction: What We’re Building
  2. MiniLisp Language Specification
  3. Compiler Architecture: The Pipeline

Part 2 — Parsing with nom

  1. Introduction to nom: Parser Combinators
  2. Setting Up the Project
  3. Recognizing Atoms: Integers, Booleans, Strings, Symbols
  4. The Abstract Syntax Tree
  5. Parsing Atoms with nom
  6. Parsing S-Expressions and Special Forms

Part 3 — Semantic Analysis

  1. Symbol Tables and Scope
  2. Checking Special Forms

Part 4 — Code Generation

  1. The C Runtime Preamble
  2. Generating C: Atoms and Expressions
  3. Generating C: Definitions and Functions
  4. Generating C: Control Flow and Sequencing

Part 5 — Putting It Together

  1. The Compilation Pipeline
  2. Testing the Compiler
  3. What’s Next: Extensions and Further Reading

Part 1 — Foundations

1. Introduction: What We’re Building

A compiler is a program that transforms source code written in one language into equivalent code in another. By the end of this course you will have written one that accepts MiniLisp — a small, clean dialect of Lisp — and produces human-readable C that you can compile and run with any standard C compiler. Along the way you will implement each classic compiler stage from scratch: lexical analysis, parsing, semantic analysis, and code generation.

Why Lisp-to-C?

Lisp is the ideal source language for a first compiler because its syntax is almost trivial to parse. Every program is either an atom (a number, string, boolean, or name) or a list of expressions wrapped in parentheses. There are no operator-precedence rules, no statement-versus-expression distinctions, and no ambiguous grammars. This lets you focus on what a compiler does rather than fighting with syntax.

C is the ideal target language because it is close to the machine — you can see exactly how each Lisp construct translates to something concrete — yet still readable. You will not be writing assembly; the C compiler handles that final step.

What you will learn

By completing this course you will understand:

  • Lexical analysis and parsing — how raw text becomes structured data.
  • Abstract syntax trees — how compilers represent programs internally.
  • Semantic analysis — how a compiler validates that a program makes sense before generating code.
  • Code generation — how high-level constructs map to lower-level constructs in a target language.
  • nom parser combinators — a practical, idiomatic Rust approach to parsing.

What you will build

The finished compiler is a single Rust binary. You pipe MiniLisp source in and get C source out:

$ echo '(define (square x) (* x x)) (display (square 5))' | minilisp-cc > out.c
$ cc -o out out.c && ./out
25

The compiler handles integers, booleans, strings, arithmetic, comparisons, conditionals, local bindings, sequencing, first-class named functions, and recursion. It is deliberately minimal — roughly 800 lines of Rust — so you can read and understand every line.

Prerequisites

You should be comfortable reading and writing Rust. You do not need prior experience with compilers, Lisp, or nom — this course introduces all three from scratch. You should have a working Rust toolchain (rustc, cargo) and a C compiler (cc or gcc) installed.

Course structure

The course is divided into five parts:

  1. Foundations — what MiniLisp looks like, what the compiler’s architecture is.
  2. Parsing with nom — turning source text into an AST.
  3. Semantic Analysis — validating the AST.
  4. Code Generation — turning the AST into C.
  5. Putting It Together — wiring the stages into a working tool and testing it.

Each section builds directly on the previous one. Code samples are cumulative: by the end, every snippet fits together into the complete compiler.


2. MiniLisp Language Specification

MiniLisp is the source language of our compiler. It is a minimal Lisp dialect with integers, booleans, strings, first-class functions, lexical scope, and a small set of built-in operators. This section defines every syntactic form precisely, gives the grammar in EBNF, and shows a complete example program so you know exactly what the compiler must handle before you write a single line of Rust.

Data types

MiniLisp has four data types:

TypeExamplesNotes
Integer42, -7, 0Signed 64-bit integers
Boolean#t, #fTrue and false
String"hello", "line\n"Double-quoted, with \\, \", \n, \t escapes
Function(lambda (x) (* x x))First-class, lexically scoped

There are no floating-point numbers, characters, lists-as-data, or nil/null values. This keeps the compiler simple while still being expressive enough to write interesting programs.

Expressions

Everything in MiniLisp is an expression — there are no statements. Every form evaluates to a value.

Atoms are the simplest expressions:

42          ; integer
#t          ; boolean true
#f          ; boolean false
"hello"     ; string
x           ; symbol (variable reference)

Function calls use prefix notation inside parentheses. The first element is the operator or function; the rest are arguments:

(+ 1 2)           ; => 3
(* 3 (+ 1 2))     ; => 9
(display "hi")     ; prints "hi", returns void

Built-in operators:

OperatorArityDescription
+, -, *, /2Arithmetic (integer division for /)
=, <, >, <=, >=2Comparison (returns #t or #f)
not1Boolean negation
and, or2Boolean connectives (not short-circuiting)
display1Print a value to stdout
newline0Print a newline character

Special forms

Special forms look like function calls but have special evaluation rules. The compiler recognises them by their leading keyword.

define — top-level definitions

;; Define a variable
(define pi 3)

;; Define a function (syntactic sugar)
(define (square x) (* x x))

;; The function form is equivalent to:
(define square (lambda (x) (* x x)))

if — conditional

(if (> x 0) x (- 0 x))   ; absolute value

Always three sub-expressions: condition, then-branch, else-branch.

lambda — anonymous function

(lambda (x y) (+ x y))

Parameter list followed by one or more body expressions. The value of the last body expression is the return value.

let — local bindings

(let ((x 10)
      (y 20))
  (+ x y))

A list of (name value) bindings followed by one or more body expressions. Bindings are not mutually visible (this is let, not let*).

begin — sequencing

(begin
  (display "hello")
  (newline)
  42)

Evaluates each expression in order, returns the value of the last one.

Comments

A semicolon ; starts a comment that extends to the end of the line:

(define x 10)  ; this is a comment

Grammar (EBNF)

program    = { expr } ;
expr       = atom | list ;
atom       = integer | boolean | string | symbol ;
integer    = [ "-" ] digit { digit } ;
boolean    = "#t" | "#f" ;
string     = '"' { string_char } '"' ;
string_char = escape | any character except '"' and '\\' ;
escape     = '\\' ( '"' | '\\' | 'n' | 't' ) ;
symbol     = symbol_start { symbol_cont } ;
symbol_start = letter | '_' | '+' | '-' | '*' | '/' | '='
             | '<' | '>' | '!' | '?' ;
symbol_cont  = symbol_start | digit ;
list       = '(' { ws } { expr { ws } } ')' ;
ws         = ' ' | '\t' | '\n' | '\r' | comment ;
comment    = ';' { any character except '\n' } '\n' ;

Complete example program

;; Recursive factorial
(define (fact n)
  (if (= n 0)
    1
    (* n (fact (- n 1)))))

;; Compute and display 10!
(display (fact 10))
(newline)

This program defines a recursive factorial function, computes 10!, and prints 3628800 followed by a newline.


3. Compiler Architecture: The Pipeline

Our compiler is a classic multi-stage pipeline: source text passes through a parser, producing an AST; the AST passes through a semantic analyser, which validates scope and form usage; the validated AST passes through a code generator, which emits C. This section maps that pipeline onto the module structure you will build and explains how data and errors flow between stages.

The four stages

  MiniLisp source text
         │
         ▼
  ┌─────────────┐
  │   Parser     │  (nom combinators)
  │  src/parser  │
  └──────┬──────┘
         │  Vec<Expr>
         ▼
  ┌─────────────┐
  │  Analyser    │  (scope checking, form validation)
  │ src/analysis │
  └──────┬──────┘
         │  Vec<Expr>  (same type, now validated)
         ▼
  ┌─────────────┐
  │  Code Gen    │  (AST → C strings)
  │ src/codegen  │
  └──────┬──────┘
         │  String
         ▼
    C source code
  1. Parser — converts raw text into a Vec<Expr>, where Expr is the AST node type. Uses nom parser combinators. Reports syntax errors with position information.

  2. Semantic Analyser — walks the AST, builds a symbol table to track variable scopes, and validates that special forms have the correct shape (e.g. if has exactly three sub-expressions). Reports semantic errors. The analyser does not transform the AST — it only checks it.

  3. Code Generator — recursively traverses the validated AST and produces a String containing the complete C program. Emits the runtime preamble first, then forward declarations, then function definitions, then main.

  4. Driver — the main function wires the stages together: read input, parse, analyse, generate, write output. It also handles CLI arguments and error display.

Module layout

minilisp-cc/
├── Cargo.toml
└── src/
    ├── main.rs       # CLI entry point, pipeline driver
    ├── ast.rs        # Expr enum and Display impl
    ├── parser.rs     # nom parsers
    ├── analysis.rs   # symbol table, scope checking, form validation
    └── codegen.rs    # C code generation

Each module has a single, clear responsibility. Data flows in one direction: main calls parser, passes the result to analysis, passes that to codegen, and writes the output.

Data flow

The key data type is Expr — the AST node. It is defined once in ast.rs and used by all stages. The parser produces Vec<Expr>. The analyser consumes Vec<Expr> and returns Vec<Expr> (unchanged, but validated). The code generator consumes Vec<Expr> and returns String.

Error handling

Each stage defines its own error type:

#![allow(unused)]
fn main() {
/// A compiler error with a human-readable message.
#[derive(Debug)]
pub struct CompileError {
    pub message: String,
}
}

In a production compiler you would track source positions (line and column) in every AST node. For simplicity, our compiler attaches position information only in error messages from the parser (nom provides this automatically) and uses descriptive messages for semantic errors.

The driver collects errors from each stage. If parsing fails, the compiler stops. If analysis finds errors, it reports all of them before stopping (so you see every problem at once, not one at a time). Code generation assumes a valid AST and does not produce errors.

Why this architecture?

Separating the compiler into independent stages has practical benefits:

  • Testability — you can unit-test each stage in isolation. Feed a string to the parser and assert on the AST. Feed an AST to the analyser and assert on the errors. Feed a valid AST to the code generator and assert on the C output.
  • Debuggability — when something goes wrong, you know which stage to look at.
  • Extensibility — adding a new language feature means adding an AST variant, a parser case, an analyser check, and a code generation case. Each change is localised to one module.

Part 2 — Parsing with nom

4. Introduction to nom: Parser Combinators

nom is a parser-combinator library: instead of writing a grammar file and running a generator, you write small Rust functions that each recognise a fragment of input, then combine them into larger parsers. This section introduces the core IResult<I, O, E> type, walks through the essential combinators (tag, char, alt, many0, map, tuple, delimited, preceded), and shows how to write, compose, and test parsers before you apply any of this to MiniLisp.

What is a parser combinator?

A parser is a function that takes some input and either succeeds (returning the parsed value and the remaining input) or fails (returning an error). A combinator is a function that takes one or more parsers and returns a new parser. By combining small parsers you build complex ones without writing a single regular expression or grammar file.

In nom, every parser has this signature:

#![allow(unused)]
fn main() {
fn parser(input: &str) -> IResult<&str, Output>
}

IResult<I, O> is defined roughly as:

#![allow(unused)]
fn main() {
type IResult<I, O> = Result<(I, O), Err<Error<I>>>;
}

On success, you get a tuple of (remaining_input, parsed_value). On failure, you get an error indicating where parsing failed.

Essential combinators

Here are the combinators you will use most. Each is a function in the nom crate.

tag — match a literal string

#![allow(unused)]
fn main() {
use nom::bytes::complete::tag;

fn parse_hello(input: &str) -> IResult<&str, &str> {
    tag("hello")(input)
}

assert_eq!(parse_hello("hello world"), Ok((" world", "hello")));
}

char — match a single character

#![allow(unused)]
fn main() {
use nom::character::complete::char;

// Matches the character '('
let result = char('(')("(abc)");
assert_eq!(result, Ok(("abc)", '(')));
}

alt — try alternatives in order

#![allow(unused)]
fn main() {
use nom::branch::alt;
use nom::bytes::complete::tag;

fn parse_bool(input: &str) -> IResult<&str, &str> {
    alt((tag("#t"), tag("#f")))(input)
}

assert_eq!(parse_bool("#t rest"), Ok((" rest", "#t")));
assert_eq!(parse_bool("#f rest"), Ok((" rest", "#f")));
}

alt tries each parser in order and returns the result of the first one that succeeds.

map — transform a parser’s output

#![allow(unused)]
fn main() {
use nom::combinator::map;
use nom::bytes::complete::tag;

fn parse_true(input: &str) -> IResult<&str, bool> {
    map(tag("#t"), |_| true)(input)
}

assert_eq!(parse_true("#t"), Ok(("", true)));
}

tuple — sequence multiple parsers

#![allow(unused)]
fn main() {
use nom::sequence::tuple;
use nom::character::complete::char;
use nom::bytes::complete::tag;

fn parse_pair(input: &str) -> IResult<&str, (char, &str, char)> {
    tuple((char('('), tag("hi"), char(')')))(input)
}

assert_eq!(parse_pair("(hi)"), Ok(("", ('(', "hi", ')'))));
}

delimited — match something between two delimiters

#![allow(unused)]
fn main() {
use nom::sequence::delimited;
use nom::character::complete::char;
use nom::bytes::complete::tag;

fn parse_parens(input: &str) -> IResult<&str, &str> {
    delimited(char('('), tag("hi"), char(')'))(input)
}

assert_eq!(parse_parens("(hi)"), Ok(("", "hi")));
}

delimited discards the opening and closing delimiters and returns only the inner value.

preceded — match a prefix and discard it

#![allow(unused)]
fn main() {
use nom::sequence::preceded;
use nom::character::complete::char;
use nom::bytes::complete::tag;

fn parse_hash_t(input: &str) -> IResult<&str, &str> {
    preceded(char('#'), tag("t"))(input)
}

assert_eq!(parse_hash_t("#t"), Ok(("", "t")));
}

many0 — match zero or more repetitions

#![allow(unused)]
fn main() {
use nom::multi::many0;
use nom::character::complete::char;

fn parse_stars(input: &str) -> IResult<&str, Vec<char>> {
    many0(char('*'))(input)
}

assert_eq!(parse_stars("***abc"), Ok(("abc", vec!['*', '*', '*'])));
assert_eq!(parse_stars("abc"), Ok(("abc", vec![])));
}

Composing parsers

The power of combinators is composition. Here is a parser that recognises a parenthesised, comma-separated pair of integers:

#![allow(unused)]
fn main() {
use nom::character::complete::{char, digit1, multispace0};
use nom::combinator::map_res;
use nom::sequence::{delimited, separated_pair};

fn parse_int(input: &str) -> IResult<&str, i64> {
    map_res(digit1, |s: &str| s.parse::<i64>())(input)
}

fn parse_pair(input: &str) -> IResult<&str, (i64, i64)> {
    delimited(
        char('('),
        separated_pair(parse_int, char(','), parse_int),
        char(')'),
    )(input)
}

assert_eq!(parse_pair("(3,7)"), Ok(("", (3, 7))));
}

Each piece is a small, testable function. You build up complexity by snapping pieces together.

Testing nom parsers

Testing a nom parser is straightforward: call the function with a test input and assert on the result.

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_parse_int() {
        assert_eq!(parse_int("42 rest"), Ok((" rest", 42)));
    }

    #[test]
    fn test_parse_int_fails_on_alpha() {
        assert!(parse_int("abc").is_err());
    }
}
}

Always test both the happy path and failure cases. Verify that the remaining input is what you expect — a parser that accidentally consumes too much or too little is a common bug.

Exercises

  1. Write a parser that recognises the string "true" and returns the Rust bool value true.
  2. Write a parser that recognises either "yes" or "no" and returns a bool.
  3. Write a parser for a parenthesised integer like (42) that returns the i64 value inside.

5. Setting Up the Project

You will create a new Rust binary crate for the compiler, add nom and any other dependencies to Cargo.toml, and lay out the module structure that the rest of the course fills in. By the end of this section you will have a project that compiles, a src/main.rs that reads from stdin, and placeholder modules for each compiler stage.

Create the crate

cargo new minilisp-cc
cd minilisp-cc

This gives you a standard Rust binary project with src/main.rs and Cargo.toml.

Add dependencies

Edit Cargo.toml:

[package]
name = "minilisp-cc"
version = "0.1.0"
edition = "2021"

[dependencies]
nom = "7"

[profile.release]
opt-level = "z"
lto = true
strip = true
codegen-units = 1

nom version 7 is the current stable release. The release profile settings minimise binary size.

Create the module files

Create one file per compiler stage:

touch src/ast.rs src/parser.rs src/analysis.rs src/codegen.rs

Your src/ directory should now look like this:

src/
├── main.rs
├── ast.rs
├── parser.rs
├── analysis.rs
└── codegen.rs

Wire up modules in main.rs

Replace src/main.rs with:

mod ast;
mod parser;
mod analysis;
mod codegen;

use std::io::{self, Read};

fn main() {
    // Read all of stdin into a string
    let mut input = String::new();
    io::stdin()
        .read_to_string(&mut input)
        .expect("failed to read stdin");

    // TODO: parse, analyse, generate
    println!("Read {} bytes", input.len());
}

Add placeholder content to each module

In src/ast.rs:

#![allow(unused)]
fn main() {
/// A MiniLisp expression — the core AST node type.
#[derive(Debug, Clone, PartialEq)]
pub enum Expr {
    // Variants will be added in §7
}
}

In src/parser.rs:

#![allow(unused)]
fn main() {
use crate::ast::Expr;

/// Parse MiniLisp source code into a list of expressions.
pub fn parse(_input: &str) -> Result<Vec<Expr>, String> {
    todo!("implement in §8–§9")
}
}

In src/analysis.rs:

#![allow(unused)]
fn main() {
use crate::ast::Expr;

/// Validate a list of parsed expressions.
pub fn analyse(_exprs: &[Expr]) -> Result<(), Vec<String>> {
    todo!("implement in §10–§11")
}
}

In src/codegen.rs:

#![allow(unused)]
fn main() {
use crate::ast::Expr;

/// Generate C source code from a list of validated expressions.
pub fn generate(_exprs: &[Expr]) -> String {
    todo!("implement in §12–§15")
}
}

Verify it compiles

cargo check

You should see a clean compilation with no errors. There will be warnings about unused imports and todo!() macros — that is fine. These will disappear as you fill in the implementations.

Checkpoint

You now have a project skeleton with:

  • A binary crate that reads from stdin.
  • Four modules with clear responsibilities and placeholder signatures.
  • nom as a dependency, ready to use.

The rest of the course fills in these modules one at a time.


6. Recognizing Atoms: Integers, Booleans, Strings, Symbols

Before building the full parser, you need nom parsers for each atomic value in MiniLisp: signed integers, boolean literals #t and #f, double-quoted strings with escape sequences, and symbol identifiers. This section develops each atom parser in isolation, explains the nom combinators used, and provides exercises to test your understanding before the parts are assembled into the full parser.

Parsing integers

A MiniLisp integer is an optional minus sign followed by one or more digits. We use recognize to capture the matched text as a single slice, then parse it into an i64:

#![allow(unused)]
fn main() {
use nom::IResult;
use nom::branch::alt;
use nom::character::complete::{char, digit1};
use nom::combinator::{map_res, opt, recognize};
use nom::sequence::pair;

/// Parse a signed integer literal.
fn parse_integer(input: &str) -> IResult<&str, i64> {
    map_res(
        recognize(pair(opt(char('-')), digit1)),
        |s: &str| s.parse::<i64>(),
    )(input)
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn positive_integer() {
        assert_eq!(parse_integer("42 rest"), Ok((" rest", 42)));
    }

    #[test]
    fn negative_integer() {
        assert_eq!(parse_integer("-7"), Ok(("", -7)));
    }

    #[test]
    fn zero() {
        assert_eq!(parse_integer("0"), Ok(("", 0)));
    }
}
}

How it works:

  • opt(char('-')) optionally matches a leading minus sign.
  • digit1 matches one or more ASCII digits.
  • pair sequences them so we match -42 as a unit.
  • recognize captures the entire matched span as a &str instead of the pair tuple.
  • map_res applies str::parse::<i64>() to convert the string to an integer, propagating parse errors as nom errors.

Parsing booleans

Booleans are the literals #t (true) and #f (false):

#![allow(unused)]
fn main() {
use nom::bytes::complete::tag;
use nom::combinator::map;

/// Parse a boolean literal: `#t` or `#f`.
fn parse_boolean(input: &str) -> IResult<&str, bool> {
    alt((
        map(tag("#t"), |_| true),
        map(tag("#f"), |_| false),
    ))(input)
}

#[test]
fn booleans() {
    assert_eq!(parse_boolean("#t"), Ok(("", true)));
    assert_eq!(parse_boolean("#f rest"), Ok((" rest", false)));
}
}

How it works:

  • tag("#t") matches the literal string #t.
  • map transforms the matched string into the Rust bool value true.
  • alt tries #t first, then #f.

Parsing strings

Strings are enclosed in double quotes and support escape sequences \\, \", \n, and \t:

#![allow(unused)]
fn main() {
use nom::bytes::complete::{tag, take_while1};
use nom::character::complete::char;
use nom::multi::many0;
use nom::sequence::delimited;

/// Parse a single character inside a string (handling escapes).
fn string_char(input: &str) -> IResult<&str, char> {
    alt((
        // Escape sequences
        map(tag("\\\\"), |_| '\\'),
        map(tag("\\\""), |_| '"'),
        map(tag("\\n"), |_| '\n'),
        map(tag("\\t"), |_| '\t'),
        // Any character that is not a quote or backslash
        nom::character::complete::none_of("\"\\"),
    ))(input)
}

/// Parse a double-quoted string literal.
fn parse_string(input: &str) -> IResult<&str, String> {
    delimited(
        char('"'),
        map(many0(string_char), |chars| chars.into_iter().collect()),
        char('"'),
    )(input)
}

#[test]
fn simple_string() {
    assert_eq!(parse_string("\"hello\""), Ok(("", "hello".to_string())));
}

#[test]
fn string_with_escapes() {
    assert_eq!(
        parse_string("\"line\\none\""),
        Ok(("", "line\none".to_string()))
    );
}

#[test]
fn empty_string() {
    assert_eq!(parse_string("\"\""), Ok(("", String::new())));
}
}

How it works:

  • string_char matches either an escape sequence or a literal character.
  • Escape sequences are tried first (the tag("\\n") branch) because none_of would otherwise consume the backslash.
  • many0(string_char) collects zero or more characters into a Vec<char>.
  • map converts the Vec<char> into a String.
  • delimited strips the surrounding double quotes.

Parsing symbols

A symbol is an identifier that starts with a letter, underscore, or one of the operator characters (+, -, *, /, =, <, >, !, ?), followed by zero or more of those characters or digits:

#![allow(unused)]
fn main() {
use nom::combinator::recognize;
use nom::multi::many0_count;
use nom::character::complete::one_of;

/// Match a character that can start a symbol.
fn symbol_start(input: &str) -> IResult<&str, char> {
    alt((
        nom::character::complete::alpha1
            .map(|s: &str| s.chars().next().unwrap()),
        one_of("_+-*/=<>!?"),
    ))(input)
}

/// Parse a symbol identifier.
fn parse_symbol(input: &str) -> IResult<&str, String> {
    map(
        recognize(pair(
            one_of("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_+-*/=<>!?"),
            many0_count(one_of(
                "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_+-*/=<>!?",
            )),
        )),
        String::from,
    )(input)
}

#[test]
fn simple_symbol() {
    assert_eq!(parse_symbol("foo"), Ok(("", "foo".to_string())));
}

#[test]
fn operator_symbol() {
    assert_eq!(parse_symbol("+ rest"), Ok((" rest", "+".to_string())));
}

#[test]
fn hyphenated_symbol() {
    assert_eq!(
        parse_symbol("my-var"),
        Ok(("", "my-var".to_string()))
    );
}
}

How it works:

  • one_of(...) matches a single character from the given set.
  • The first one_of ensures the symbol starts with a valid leading character (not a digit).
  • many0_count matches zero or more continuation characters but only counts them (we use recognize to capture the whole span).
  • recognize gives us the entire matched text as a &str, which we convert to a String.

Important ordering note

When combining these parsers later, the order matters. parse_integer and parse_symbol can both start with -. The rule is: try parse_integer first, and only fall back to parse_symbol if it fails. We handle this with alt in the next section.

Exercises

  1. Extend parse_string to support the escape sequence \0 (null character).
  2. Write tests for edge cases: the symbol - (just a minus sign, a valid symbol), the integer -0, and the string "\\\"" (a string containing a backslash followed by a quote).
  3. What happens if you try parse_integer on the input "abc"? Read the nom error and explain what it means.

7. The Abstract Syntax Tree

The parser’s output is an Abstract Syntax Tree — a Rust data structure that captures the meaning of a MiniLisp program without the syntactic noise of parentheses and whitespace. This section defines the Expr enum and its variants, discusses why the tree is structured the way it is, and implements Display so you can inspect parse results during development.

What is an AST?

An Abstract Syntax Tree strips away syntactic details — parentheses, whitespace, comments — and represents only the structural meaning of a program. Two MiniLisp programs that differ only in whitespace or comments produce identical ASTs. The AST is the compiler’s internal representation from the parser onwards; every subsequent stage operates on it.

Designing the Expr enum

There are two common approaches for representing Lisp in an AST:

Option A — Generic lists. Everything is either an atom or a List(Vec<Expr>). Special forms like if and define are not distinguished during parsing; they are recognised later during semantic analysis.

Option B — Specific variants. Each special form gets its own enum variant with typed fields. The parser does more work, but later stages deal with well-typed data.

We use Option B. The benefit is exhaustive pattern matching: if you add a new variant, the Rust compiler forces you to handle it in the analyser and code generator. Bugs become compile errors instead of runtime surprises.

The complete Expr enum

Add this to src/ast.rs:

#![allow(unused)]
fn main() {
/// A MiniLisp expression — the core AST node type.
#[derive(Debug, Clone, PartialEq)]
pub enum Expr {
    /// Integer literal: `42`, `-7`
    Int(i64),

    /// Boolean literal: `#t`, `#f`
    Bool(bool),

    /// String literal: `"hello"`
    Str(String),

    /// Symbol (variable or function name): `x`, `+`, `my-func`
    Symbol(String),

    /// Function call: `(f arg1 arg2 ...)`
    /// The first element is the function; the rest are arguments.
    Call(Box<Expr>, Vec<Expr>),

    /// Variable definition: `(define name value)`
    Define(String, Box<Expr>),

    /// Function definition: `(define (name params...) body...)`
    DefineFunc(String, Vec<String>, Vec<Expr>),

    /// Conditional: `(if condition then else)`
    If(Box<Expr>, Box<Expr>, Box<Expr>),

    /// Anonymous function: `(lambda (params...) body...)`
    Lambda(Vec<String>, Vec<Expr>),

    /// Local bindings: `(let ((name val) ...) body...)`
    Let(Vec<(String, Expr)>, Vec<Expr>),

    /// Sequencing: `(begin expr1 expr2 ...)`
    Begin(Vec<Expr>),
}
}

Design notes:

  • Box<Expr> is used wherever a single sub-expression is required. Without Box, the enum would be infinitely sized because Expr contains Expr.
  • Vec<Expr> is used for variable-length lists of sub-expressions (function arguments, body expressions).
  • Define and DefineFunc are separate variants because they have different structures. (define x 10) binds a name to a value; (define (f x) ...) binds a name to a function with parameters and a body.
  • Call stores the function expression (which could be a symbol, a lambda, or even another call) as a Box<Expr>, and the arguments as a Vec<Expr>.

How MiniLisp maps to the AST

MiniLispAST
42Expr::Int(42)
#tExpr::Bool(true)
"hi"Expr::Str("hi".into())
xExpr::Symbol("x".into())
(+ 1 2)Expr::Call(Box::new(Expr::Symbol("+".into())), vec![Expr::Int(1), Expr::Int(2)])
(define x 10)Expr::Define("x".into(), Box::new(Expr::Int(10)))
(if #t 1 0)Expr::If(Box::new(Expr::Bool(true)), Box::new(Expr::Int(1)), Box::new(Expr::Int(0)))

Implementing Display

A Display implementation lets you print ASTs in a readable Lisp-like format, which is invaluable during debugging:

#![allow(unused)]
fn main() {
use std::fmt;

impl fmt::Display for Expr {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        match self {
            Expr::Int(n) => write!(f, "{}", n),
            Expr::Bool(b) => write!(f, "{}", if *b { "#t" } else { "#f" }),
            Expr::Str(s) => write!(f, "\"{}\"", s),
            Expr::Symbol(s) => write!(f, "{}", s),
            Expr::Call(func, args) => {
                write!(f, "({}", func)?;
                for arg in args {
                    write!(f, " {}", arg)?;
                }
                write!(f, ")")
            }
            Expr::Define(name, val) => write!(f, "(define {} {})", name, val),
            Expr::DefineFunc(name, params, body) => {
                write!(f, "(define ({}", name)?;
                for p in params {
                    write!(f, " {}", p)?;
                }
                write!(f, ")")?;
                for expr in body {
                    write!(f, " {}", expr)?;
                }
                write!(f, ")")
            }
            Expr::If(cond, then, els) => {
                write!(f, "(if {} {} {})", cond, then, els)
            }
            Expr::Lambda(params, body) => {
                write!(f, "(lambda (")?;
                for (i, p) in params.iter().enumerate() {
                    if i > 0 { write!(f, " ")?; }
                    write!(f, "{}", p)?;
                }
                write!(f, ")")?;
                for expr in body {
                    write!(f, " {}", expr)?;
                }
                write!(f, ")")
            }
            Expr::Let(bindings, body) => {
                write!(f, "(let (")?;
                for (i, (name, val)) in bindings.iter().enumerate() {
                    if i > 0 { write!(f, " ")?; }
                    write!(f, "({} {})", name, val)?;
                }
                write!(f, ")")?;
                for expr in body {
                    write!(f, " {}", expr)?;
                }
                write!(f, ")")
            }
            Expr::Begin(exprs) => {
                write!(f, "(begin")?;
                for expr in exprs {
                    write!(f, " {}", expr)?;
                }
                write!(f, ")")
            }
        }
    }
}
}

You can now println!("{}", expr) any AST node and get readable output.


8. Parsing Atoms with nom

With atom parsers and the AST defined, this section assembles them into a single parse_atom function that recognises any MiniLisp atom and returns the corresponding Expr variant. You will use alt to try each alternative in turn, learn how nom reports errors and how to interpret them, and write unit tests that verify correct parsing of every atom type.

Connecting parsers to the AST

In §6 we wrote standalone parsers that returned Rust primitives (i64, bool, String). Now we wrap each one with map to produce Expr variants.

Add these to src/parser.rs:

#![allow(unused)]
fn main() {
use nom::IResult;
use nom::branch::alt;
use nom::bytes::complete::tag;
use nom::character::complete::{char, digit1, none_of, one_of};
use nom::combinator::{map, map_res, opt, recognize};
use nom::multi::{many0, many0_count};
use nom::sequence::{delimited, pair};

use crate::ast::Expr;

/// Parse an integer literal and return `Expr::Int`.
fn parse_int(input: &str) -> IResult<&str, Expr> {
    map(
        map_res(
            recognize(pair(opt(char('-')), digit1)),
            |s: &str| s.parse::<i64>(),
        ),
        Expr::Int,
    )(input)
}

/// Parse a boolean literal and return `Expr::Bool`.
fn parse_bool(input: &str) -> IResult<&str, Expr> {
    alt((
        map(tag("#t"), |_| Expr::Bool(true)),
        map(tag("#f"), |_| Expr::Bool(false)),
    ))(input)
}

/// Parse a string literal and return `Expr::Str`.
fn parse_str(input: &str) -> IResult<&str, Expr> {
    map(
        delimited(
            char('"'),
            map(many0(string_char), |chars| {
                chars.into_iter().collect::<String>()
            }),
            char('"'),
        ),
        Expr::Str,
    )(input)
}

fn string_char(input: &str) -> IResult<&str, char> {
    alt((
        map(tag("\\\\"), |_| '\\'),
        map(tag("\\\""), |_| '"'),
        map(tag("\\n"), |_| '\n'),
        map(tag("\\t"), |_| '\t'),
        none_of("\"\\"),
    ))(input)
}

/// Parse a symbol and return `Expr::Symbol`.
fn parse_symbol(input: &str) -> IResult<&str, Expr> {
    map(
        map(
            recognize(pair(
                one_of(
                    "abcdefghijklmnopqrstuvwxyz\
                     ABCDEFGHIJKLMNOPQRSTUVWXYZ\
                     _+-*/=<>!?",
                ),
                many0_count(one_of(
                    "abcdefghijklmnopqrstuvwxyz\
                     ABCDEFGHIJKLMNOPQRSTUVWXYZ\
                     0123456789_+-*/=<>!?",
                )),
            )),
            String::from,
        ),
        Expr::Symbol,
    )(input)
}
}

Combining atoms with alt

The parse_atom function tries each atom parser in order:

#![allow(unused)]
fn main() {
/// Parse any atom: integer, boolean, string, or symbol.
pub fn parse_atom(input: &str) -> IResult<&str, Expr> {
    alt((parse_bool, parse_str, parse_int, parse_symbol))(input)
}
}

Ordering matters. Booleans are tried first because #t and #f start with #, which no other atom type uses. Strings are next because they start with ". Integers come before symbols because a bare - is a valid symbol — we want -42 to parse as an integer, not as the symbol - followed by 42. If the integer parser fails (no digits after the minus), nom backtracks and the symbol parser gets a chance.

Understanding nom errors

When all alternatives in alt fail, nom returns an error pointing to the position where parsing failed. For example:

#![allow(unused)]
fn main() {
let result = parse_atom("@invalid");
// Err(Err::Error(Error { input: "@invalid", code: ErrorKind::OneOf }))
}

The error says: at position @invalid, the OneOf combinator (inside parse_symbol) could not match any expected character. This is the last alternative that was tried.

During development, you can use nom::error::VerboseError instead of the default error type to get a stack trace of which combinators were attempted. This is helpful for debugging but adds overhead, so switch back to the default for production.

Unit tests

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn atom_integer() {
        assert_eq!(parse_atom("42"), Ok(("", Expr::Int(42))));
        assert_eq!(parse_atom("-7 rest"), Ok((" rest", Expr::Int(-7))));
    }

    #[test]
    fn atom_boolean() {
        assert_eq!(parse_atom("#t"), Ok(("", Expr::Bool(true))));
        assert_eq!(parse_atom("#f"), Ok(("", Expr::Bool(false))));
    }

    #[test]
    fn atom_string() {
        assert_eq!(
            parse_atom("\"hello\""),
            Ok(("", Expr::Str("hello".to_string())))
        );
    }

    #[test]
    fn atom_symbol() {
        assert_eq!(
            parse_atom("foo"),
            Ok(("", Expr::Symbol("foo".to_string())))
        );
        assert_eq!(
            parse_atom("+"),
            Ok(("", Expr::Symbol("+".to_string())))
        );
    }

    #[test]
    fn minus_is_symbol_not_integer() {
        // A bare "-" with no digits is the symbol "-"
        assert_eq!(
            parse_atom("- rest"),
            Ok((" rest", Expr::Symbol("-".to_string())))
        );
    }
}
}

The last test is important: it verifies that a bare minus sign is parsed as a symbol, not as a failed integer parse. This is the ordering logic from alt at work.


9. Parsing S-Expressions and Special Forms

S-expressions are parenthesised lists: the heart of Lisp syntax. This section extends the parser to handle arbitrarily nested lists, whitespace between elements, and comments. It then lifts special forms — define, if, lambda, let, begin — out of the generic list parser so they become distinct AST variants, and covers how to handle recursive parsers in nom without running into borrow-checker problems.

Whitespace and comments

Before parsing lists, we need a combinator that skips whitespace and comments:

#![allow(unused)]
fn main() {
use nom::character::complete::{multispace0, not_line_ending};

/// Skip whitespace and comments.
fn ws(input: &str) -> IResult<&str, ()> {
    let mut remaining = input;
    loop {
        // Skip whitespace
        let (r, _) = multispace0(remaining)?;
        remaining = r;
        // If we see a comment, skip it
        if remaining.starts_with(';') {
            let (r, _) = not_line_ending(remaining)?;
            remaining = r;
        } else {
            break;
        }
    }
    Ok((remaining, ()))
}
}

This loop handles multiple comments and whitespace in any order.

Parsing a generic list

A list is an opening parenthesis, zero or more whitespace-separated expressions, and a closing parenthesis. The key challenge is that parse_expr (which parses any expression, including lists) must call itself recursively — a list can contain lists.

In nom, recursive parsers require a function boundary. You cannot use closures because closures capture references, creating lifetime issues. Instead, use regular fn items:

#![allow(unused)]
fn main() {
/// Parse any expression: atom or list.
pub fn parse_expr(input: &str) -> IResult<&str, Expr> {
    let (input, _) = ws(input)?;
    alt((parse_list, parse_atom))(input)
}

/// Parse a parenthesised list of expressions.
fn parse_list(input: &str) -> IResult<&str, Expr> {
    let (input, _) = char('(')(input)?;
    let (input, _) = ws(input)?;

    // Peek at the first symbol to detect special forms
    // We'll try special form parsers first
    let result = alt((
        parse_define,
        parse_if,
        parse_lambda,
        parse_let,
        parse_begin,
        parse_call,
    ))(input);

    let (input, expr) = result?;
    let (input, _) = ws(input)?;
    let (input, _) = char(')')(input)?;
    Ok((input, expr))
}
}

Parsing special forms

Each special form parser matches the keyword and then parses the form’s specific structure. These parsers consume everything between the parentheses — the opening and closing parens are handled by parse_list.

define — two forms:

#![allow(unused)]
fn main() {
use nom::sequence::preceded;

/// Parse the inside of a `define` form.
fn parse_define(input: &str) -> IResult<&str, Expr> {
    let (input, _) = tag("define")(input)?;
    let (input, _) = ws(input)?;

    // Try function definition form: (define (name params...) body...)
    if input.starts_with('(') {
        let (input, _) = char('(')(input)?;
        let (input, _) = ws(input)?;
        let (input, name) = parse_symbol_name(input)?;
        let (input, params) = many0(preceded(ws, parse_symbol_name))(input)?;
        let (input, _) = ws(input)?;
        let (input, _) = char(')')(input)?;
        let (input, body) = many1_expr(input)?;
        return Ok((input, Expr::DefineFunc(name, params, body)));
    }

    // Variable definition form: (define name value)
    let (input, name) = parse_symbol_name(input)?;
    let (input, _) = ws(input)?;
    let (input, value) = parse_expr(input)?;
    Ok((input, Expr::Define(name, Box::new(value))))
}

/// Parse a bare symbol name (returns String, not Expr).
fn parse_symbol_name(input: &str) -> IResult<&str, String> {
    map(
        recognize(pair(
            one_of(
                "abcdefghijklmnopqrstuvwxyz\
                 ABCDEFGHIJKLMNOPQRSTUVWXYZ\
                 _+-*/=<>!?",
            ),
            many0_count(one_of(
                "abcdefghijklmnopqrstuvwxyz\
                 ABCDEFGHIJKLMNOPQRSTUVWXYZ\
                 0123456789_+-*/=<>!?",
            )),
        )),
        String::from,
    )(input)
}

/// Parse one or more expressions (used for body forms).
fn many1_expr(input: &str) -> IResult<&str, Vec<Expr>> {
    let (input, first) = preceded(ws, parse_expr)(input)?;
    let (input, mut rest) = many0(preceded(ws, parse_expr))(input)?;
    rest.insert(0, first);
    Ok((input, rest))
}
}

if — exactly three sub-expressions:

#![allow(unused)]
fn main() {
fn parse_if(input: &str) -> IResult<&str, Expr> {
    let (input, _) = tag("if")(input)?;
    let (input, _) = ws(input)?;
    let (input, cond) = parse_expr(input)?;
    let (input, _) = ws(input)?;
    let (input, then) = parse_expr(input)?;
    let (input, _) = ws(input)?;
    let (input, els) = parse_expr(input)?;
    Ok((
        input,
        Expr::If(Box::new(cond), Box::new(then), Box::new(els)),
    ))
}
}

lambda — parameter list and body:

#![allow(unused)]
fn main() {
fn parse_lambda(input: &str) -> IResult<&str, Expr> {
    let (input, _) = tag("lambda")(input)?;
    let (input, _) = ws(input)?;
    let (input, _) = char('(')(input)?;
    let (input, params) = many0(preceded(ws, parse_symbol_name))(input)?;
    let (input, _) = ws(input)?;
    let (input, _) = char(')')(input)?;
    let (input, body) = many1_expr(input)?;
    Ok((input, Expr::Lambda(params, body)))
}
}

let — binding list and body:

#![allow(unused)]
fn main() {
fn parse_let(input: &str) -> IResult<&str, Expr> {
    let (input, _) = tag("let")(input)?;
    let (input, _) = ws(input)?;
    let (input, _) = char('(')(input)?;
    let (input, bindings) = many0(preceded(ws, parse_binding))(input)?;
    let (input, _) = ws(input)?;
    let (input, _) = char(')')(input)?;
    let (input, body) = many1_expr(input)?;
    Ok((input, Expr::Let(bindings, body)))
}

fn parse_binding(input: &str) -> IResult<&str, (String, Expr)> {
    let (input, _) = char('(')(input)?;
    let (input, _) = ws(input)?;
    let (input, name) = parse_symbol_name(input)?;
    let (input, _) = ws(input)?;
    let (input, val) = parse_expr(input)?;
    let (input, _) = ws(input)?;
    let (input, _) = char(')')(input)?;
    Ok((input, (name, val)))
}
}

begin — one or more expressions:

#![allow(unused)]
fn main() {
fn parse_begin(input: &str) -> IResult<&str, Expr> {
    let (input, _) = tag("begin")(input)?;
    let (input, exprs) = many1_expr(input)?;
    Ok((input, Expr::Begin(exprs)))
}
}

Generic function call — anything else:

#![allow(unused)]
fn main() {
fn parse_call(input: &str) -> IResult<&str, Expr> {
    let (input, func) = parse_expr(input)?;
    let (input, args) = many0(preceded(ws, parse_expr))(input)?;
    Ok((input, Expr::Call(Box::new(func), args)))
}
}

Parsing a program

A program is zero or more top-level expressions:

#![allow(unused)]
fn main() {
/// Parse a complete MiniLisp program.
pub fn parse_program(input: &str) -> IResult<&str, Vec<Expr>> {
    let (input, _) = ws(input)?;
    let (input, exprs) = many0(preceded(ws, parse_expr))(input)?;
    let (input, _) = ws(input)?;
    Ok((input, exprs))
}
}

Handling recursion and the borrow checker

The recursive call from parse_listparse_exprparse_list works because both are plain fn items, not closures. nom processes &str slices, so there are no ownership issues — each parser borrows the input, consumes some of it, and returns the remainder.

If you find yourself wanting to use closures for recursive parsers, the workaround is to wrap the recursive call in a helper function or use nom::combinator::cut to improve error messages at the boundary.

Tests

#![allow(unused)]
fn main() {
#[test]
fn parse_simple_call() {
    let (_, expr) = parse_expr("(+ 1 2)").unwrap();
    assert_eq!(
        expr,
        Expr::Call(
            Box::new(Expr::Symbol("+".to_string())),
            vec![Expr::Int(1), Expr::Int(2)],
        )
    );
}

#[test]
fn parse_nested() {
    let (_, expr) = parse_expr("(* 3 (+ 1 2))").unwrap();
    match expr {
        Expr::Call(_, args) => assert_eq!(args.len(), 2),
        _ => panic!("expected Call"),
    }
}

#[test]
fn parse_define_var() {
    let (_, expr) = parse_expr("(define x 42)").unwrap();
    assert_eq!(
        expr,
        Expr::Define("x".to_string(), Box::new(Expr::Int(42)))
    );
}

#[test]
fn parse_define_func() {
    let (_, expr) = parse_expr("(define (f x) x)").unwrap();
    match expr {
        Expr::DefineFunc(name, params, _) => {
            assert_eq!(name, "f");
            assert_eq!(params, vec!["x".to_string()]);
        }
        _ => panic!("expected DefineFunc"),
    }
}

#[test]
fn parse_if_expr() {
    let (_, expr) = parse_expr("(if #t 1 0)").unwrap();
    match expr {
        Expr::If(_, _, _) => {} // ok
        _ => panic!("expected If"),
    }
}

#[test]
fn parse_with_comments() {
    let input = "; comment\n(+ 1 2)";
    let (_, expr) = parse_expr(input).unwrap();
    match expr {
        Expr::Call(_, _) => {} // ok
        _ => panic!("expected Call"),
    }
}
}

Part 3 — Semantic Analysis

10. Symbol Tables and Scope

A symbol table maps names to their definitions. This section walks through a scope-aware traversal of the AST that builds a symbol table, resolves every symbol reference to its definition, and reports helpful errors for undefined names or names used outside their scope. You will implement a simple environment chain — the standard technique for representing nested lexical scopes.

What is a symbol table?

A symbol table answers the question: “when the program says x, what does x refer to?” In MiniLisp, x could be a top-level definition, a function parameter, or a local let binding. The symbol table tracks every name that is in scope at every point in the program.

Environment chains

An environment chain is a stack of scopes. Each scope is a set of names. When you look up a name, you search from the innermost scope outward. If no scope contains the name, it is undefined.

#![allow(unused)]
fn main() {
use std::collections::HashSet;

/// A chain of lexical scopes.
pub struct Env {
    scopes: Vec<HashSet<String>>,
}

impl Env {
    /// Create a new environment with a single empty global scope.
    pub fn new() -> Self {
        Env {
            scopes: vec![HashSet::new()],
        }
    }

    /// Push a new scope (entering a function, let, etc.).
    pub fn push_scope(&mut self) {
        self.scopes.push(HashSet::new());
    }

    /// Pop the innermost scope (leaving a function, let, etc.).
    pub fn pop_scope(&mut self) {
        self.scopes.pop();
    }

    /// Define a name in the current (innermost) scope.
    pub fn define(&mut self, name: &str) {
        if let Some(scope) = self.scopes.last_mut() {
            scope.insert(name.to_string());
        }
    }

    /// Check whether a name is defined in any enclosing scope.
    pub fn is_defined(&self, name: &str) -> bool {
        self.scopes.iter().rev().any(|scope| scope.contains(name))
    }
}
}

Why a Vec<HashSet>? Each HashSet is one scope. Pushing and popping is O(1). Lookup walks the vector from back to front, which is O(number of scopes) — for typical programs with a few levels of nesting, this is trivially fast.

Built-in names

The built-in operators (+, -, *, /, =, <, >, <=, >=, not, and, or, display, newline) are always in scope. Define them in the global scope at startup:

#![allow(unused)]
fn main() {
impl Env {
    /// Create an environment with built-in names pre-defined.
    pub fn with_builtins() -> Self {
        let mut env = Env::new();
        for name in &[
            "+", "-", "*", "/", "=", "<", ">", "<=", ">=",
            "not", "and", "or", "display", "newline",
        ] {
            env.define(name);
        }
        env
    }
}
}

Walking the AST

The analyser traverses the AST recursively, maintaining the environment as it goes. For each node:

  • Symbol — check that the name is defined in the current environment. If not, record an error.
  • Define / DefineFunc — add the name to the current scope, then check the body.
  • Lambda — push a new scope, add parameters, check the body, pop the scope.
  • Let — push a new scope, evaluate each binding’s value in the outer scope (this is let, not let*), add all binding names to the inner scope, check the body, pop the scope.
  • If, Call, Begin — recursively check all sub-expressions.
#![allow(unused)]
fn main() {
use crate::ast::Expr;

/// Check all expressions for scope errors.
/// Returns a list of error messages (empty means success).
pub fn check_scope(exprs: &[Expr]) -> Vec<String> {
    let mut env = Env::with_builtins();
    let mut errors = Vec::new();

    for expr in exprs {
        check_expr(expr, &mut env, &mut errors);
    }

    errors
}

fn check_expr(expr: &Expr, env: &mut Env, errors: &mut Vec<String>) {
    match expr {
        Expr::Int(_) | Expr::Bool(_) | Expr::Str(_) => {
            // Atoms are always valid
        }
        Expr::Symbol(name) => {
            if !env.is_defined(name) {
                errors.push(format!("undefined symbol: {}", name));
            }
        }
        Expr::Call(func, args) => {
            check_expr(func, env, errors);
            for arg in args {
                check_expr(arg, env, errors);
            }
        }
        Expr::Define(name, value) => {
            check_expr(value, env, errors);
            env.define(name);
        }
        Expr::DefineFunc(name, params, body) => {
            env.define(name); // allow recursion
            env.push_scope();
            for p in params {
                env.define(p);
            }
            for expr in body {
                check_expr(expr, env, errors);
            }
            env.pop_scope();
        }
        Expr::If(cond, then, els) => {
            check_expr(cond, env, errors);
            check_expr(then, env, errors);
            check_expr(els, env, errors);
        }
        Expr::Lambda(params, body) => {
            env.push_scope();
            for p in params {
                env.define(p);
            }
            for expr in body {
                check_expr(expr, env, errors);
            }
            env.pop_scope();
        }
        Expr::Let(bindings, body) => {
            // Evaluate binding values in the outer scope
            for (_, val) in bindings {
                check_expr(val, env, errors);
            }
            // Then introduce binding names in a new scope
            env.push_scope();
            for (name, _) in bindings {
                env.define(name);
            }
            for expr in body {
                check_expr(expr, env, errors);
            }
            env.pop_scope();
        }
        Expr::Begin(exprs) => {
            for expr in exprs {
                check_expr(expr, env, errors);
            }
        }
    }
}
}

Error reporting

Notice that DefineFunc defines the function name before checking the body. This allows recursive functions to refer to themselves. Without this, (define (fact n) ... (fact (- n 1))) would report fact as undefined inside its own body.

The analyser collects all errors before stopping, so the programmer sees every problem at once:

undefined symbol: foo
undefined symbol: bar

Tests

#![allow(unused)]
fn main() {
#[test]
fn undefined_symbol() {
    let exprs = vec![Expr::Symbol("x".to_string())];
    let errors = check_scope(&exprs);
    assert_eq!(errors, vec!["undefined symbol: x"]);
}

#[test]
fn defined_symbol() {
    let exprs = vec![
        Expr::Define("x".to_string(), Box::new(Expr::Int(1))),
        Expr::Symbol("x".to_string()),
    ];
    let errors = check_scope(&exprs);
    assert!(errors.is_empty());
}

#[test]
fn builtin_is_defined() {
    let exprs = vec![Expr::Symbol("+".to_string())];
    let errors = check_scope(&exprs);
    assert!(errors.is_empty());
}

#[test]
fn lambda_scope() {
    // (lambda (x) x) — x is defined inside the lambda
    let exprs = vec![Expr::Lambda(
        vec!["x".to_string()],
        vec![Expr::Symbol("x".to_string())],
    )];
    let errors = check_scope(&exprs);
    assert!(errors.is_empty());
}

#[test]
fn lambda_scope_does_not_leak() {
    // (lambda (x) x) then reference x — x should be undefined
    let exprs = vec![
        Expr::Lambda(
            vec!["x".to_string()],
            vec![Expr::Symbol("x".to_string())],
        ),
        Expr::Symbol("x".to_string()),
    ];
    let errors = check_scope(&exprs);
    assert_eq!(errors, vec!["undefined symbol: x"]);
}
}

11. Checking Special Forms

Special forms have fixed shapes: if needs exactly three sub-expressions; define needs a name and a body; lambda needs a parameter list and at least one body expression. This section adds arity and shape checks for each special form so that malformed programs produce clear error messages rather than mysterious C output.

Why check form shapes?

The parser already enforces some structure — it will not produce an If node with two sub-expressions because the parser expects exactly three. However, the parser’s error messages for malformed input can be cryptic (nom reports parsing failures, not semantic problems). The analyser provides a second line of defence with clear, domain-specific error messages.

More importantly, if the AST is ever constructed programmatically (for example, by a macro expander), the analyser catches structural errors that the parser would have prevented.

Checks to implement

FormRuleError message
Definevalue must not be emptydefine: missing value
DefineFuncmust have at least one body expressiondefine: function body is empty
Ifmust have exactly condition, then, else(enforced by AST structure)
Lambdamust have at least one body expressionlambda: body is empty
Letmust have at least one body expressionlet: body is empty
Leteach binding must have exactly a name and value(enforced by AST structure)
Beginmust have at least one expressionbegin: empty begin block

Several of these are enforced by the AST’s type structure — for example, If always has exactly three Box<Expr> fields, so you cannot construct a two-armed If. The checks below handle the cases that the type system does not catch.

Implementation

Add form validation to src/analysis.rs, integrated with the scope checker:

#![allow(unused)]
fn main() {
/// Validate the shape of special forms.
fn check_forms(expr: &Expr, errors: &mut Vec<String>) {
    match expr {
        Expr::DefineFunc(name, _, body) => {
            if body.is_empty() {
                errors.push(format!(
                    "define: function '{}' body is empty", name
                ));
            }
            for e in body {
                check_forms(e, errors);
            }
        }
        Expr::Lambda(_, body) => {
            if body.is_empty() {
                errors.push("lambda: body is empty".to_string());
            }
            for e in body {
                check_forms(e, errors);
            }
        }
        Expr::Let(bindings, body) => {
            if body.is_empty() {
                errors.push("let: body is empty".to_string());
            }
            for (_, val) in bindings {
                check_forms(val, errors);
            }
            for e in body {
                check_forms(e, errors);
            }
        }
        Expr::Begin(exprs) => {
            if exprs.is_empty() {
                errors.push("begin: empty begin block".to_string());
            }
            for e in exprs {
                check_forms(e, errors);
            }
        }
        Expr::Define(_, val) => {
            check_forms(val, errors);
        }
        Expr::If(cond, then, els) => {
            check_forms(cond, errors);
            check_forms(then, errors);
            check_forms(els, errors);
        }
        Expr::Call(func, args) => {
            check_forms(func, errors);
            for a in args {
                check_forms(a, errors);
            }
        }
        Expr::Int(_) | Expr::Bool(_) | Expr::Str(_) | Expr::Symbol(_) => {}
    }
}
}

Combining scope and form checks

The public analyse function runs both passes:

#![allow(unused)]
fn main() {
/// Run all semantic checks on a parsed program.
pub fn analyse(exprs: &[Expr]) -> Result<(), Vec<String>> {
    let mut errors = Vec::new();

    // Pass 1: scope checking
    errors.extend(check_scope(exprs));

    // Pass 2: form validation
    for expr in exprs {
        check_forms(expr, &mut errors);
    }

    if errors.is_empty() {
        Ok(())
    } else {
        Err(errors)
    }
}
}

Tests

#![allow(unused)]
fn main() {
#[test]
fn empty_begin() {
    let exprs = vec![Expr::Begin(vec![])];
    let result = analyse(&exprs);
    assert!(result.is_err());
    let errs = result.unwrap_err();
    assert!(errs.iter().any(|e| e.contains("empty begin block")));
}

#[test]
fn empty_lambda_body() {
    let exprs = vec![Expr::Lambda(vec!["x".to_string()], vec![])];
    let result = analyse(&exprs);
    assert!(result.is_err());
    let errs = result.unwrap_err();
    assert!(errs.iter().any(|e| e.contains("lambda: body is empty")));
}

#[test]
fn valid_program() {
    let exprs = vec![
        Expr::Define("x".to_string(), Box::new(Expr::Int(10))),
        Expr::Call(
            Box::new(Expr::Symbol("display".to_string())),
            vec![Expr::Symbol("x".to_string())],
        ),
    ];
    assert!(analyse(&exprs).is_ok());
}
}

Part 4 — Code Generation

12. The C Runtime Preamble

Every MiniLisp program compiles to a C file that begins with a standard preamble: #include directives, type aliases, boolean constants, and thin wrappers for built-in operations like display and newline. This section designs the preamble, explains why each piece is there, and shows how the code generator emits it before any user-defined code.

Why a preamble?

MiniLisp has built-in operations — display, newline, arithmetic, comparisons — that the generated C code calls. These need C implementations. Rather than linking a separate runtime library, we emit the implementations directly at the top of the C file. This makes every generated file self-contained: you compile it with cc -o out out.c and it works.

The complete preamble

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/* Boolean type */
typedef int bool_t;
#define TRUE 1
#define FALSE 0

/* Display functions */
static void display_int(long long x) {
    printf("%lld", x);
}

static void display_bool(bool_t x) {
    printf("%s", x ? "#t" : "#f");
}

static void display_str(const char *s) {
    printf("%s", s);
}

static void newline_() {
    printf("\n");
}

Line by line:

  • #include <stdio.h> — for printf.
  • #include <stdlib.h> — for exit, used by error-handling code.
  • #include <string.h> — for any string operations.
  • typedef int bool_t — MiniLisp booleans become C integers. TRUE and FALSE are macros for readability.
  • display_int, display_bool, display_str — MiniLisp’s display is polymorphic (it can print any value). Since our compiler knows the type of each expression at compile time (integers, booleans, and strings are the only printable types), it calls the correct display_* variant directly.
  • newline_ — the trailing underscore avoids clashing with C’s newline if one exists in the environment. This is a pattern we use throughout: MiniLisp names are mangled to avoid C keyword and standard library conflicts.

Emitting the preamble

In src/codegen.rs, define the preamble as a constant:

#![allow(unused)]
fn main() {
const PREAMBLE: &str = r#"#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef int bool_t;
#define TRUE 1
#define FALSE 0

static void display_int(long long x) {
    printf("%lld", x);
}

static void display_bool(bool_t x) {
    printf("%s", x ? "#t" : "#f");
}

static void display_str(const char *s) {
    printf("%s", s);
}

static void newline_() {
    printf("\n");
}

"#;
}

The code generator begins every output file with this string. User-defined code follows immediately after.

Design decision: monomorphic display

A more sophisticated compiler would use a tagged union (a “value” type) so that display could handle any value at runtime. We avoid this because:

  1. It adds significant complexity (you need a value representation, tag checks, and memory management).
  2. Our compiler knows at AST level whether an expression is an integer, boolean, or string, so it can emit the correct display_* call directly.
  3. The goal is simplicity and clarity — you can extend this later if you want dynamic typing.

13. Generating C: Atoms and Expressions

This section implements the expression code generator — the recursive function that turns an Expr into a C expression string. Integers become C integer literals; booleans become TRUE and FALSE; strings become string literals; arithmetic and comparison operations become C operators; function calls become C function-call syntax. You will also handle name-mangling: turning Lisp symbols like my-var into valid C identifiers.

Name mangling

MiniLisp symbols can contain characters that are not valid in C identifiers: -, +, *, /, =, <, >, !, ?. We replace each with a safe suffix:

#![allow(unused)]
fn main() {
/// Convert a MiniLisp symbol into a valid C identifier.
fn mangle(name: &str) -> String {
    let mut out = String::new();
    for ch in name.chars() {
        match ch {
            '-' => out.push_str("_dash_"),
            '+' => out.push_str("_plus_"),
            '*' => out.push_str("_star_"),
            '/' => out.push_str("_slash_"),
            '=' => out.push_str("_eq_"),
            '<' => out.push_str("_lt_"),
            '>' => out.push_str("_gt_"),
            '!' => out.push_str("_bang_"),
            '?' => out.push_str("_qmark_"),
            c => out.push(c),
        }
    }
    out
}
}

Examples:

MiniLispC identifier
xx
my-varmy_dash_var
<=_lt__eq_
zero?zero_qmark_

Generating atoms

#![allow(unused)]
fn main() {
/// Generate a C expression from an AST node.
fn gen_expr(expr: &Expr) -> String {
    match expr {
        Expr::Int(n) => format!("{}LL", n),
        Expr::Bool(b) => if *b { "TRUE".to_string() } else { "FALSE".to_string() },
        Expr::Str(s) => format!("\"{}\"", s.replace('\\', "\\\\")
                                            .replace('"', "\\\"")
                                            .replace('\n', "\\n")
                                            .replace('\t', "\\t")),
        Expr::Symbol(name) => mangle(name),
        // ... other variants handled below
        _ => todo!(),
    }
}
}

Details:

  • Integers get the LL suffix to ensure they are long long in C, matching our display_int signature.
  • Booleans become the macros TRUE and FALSE from the preamble.
  • Strings are re-escaped for C. The MiniLisp parser already resolved escape sequences into actual characters, so we must re-escape them for the C source.
  • Symbols are mangled into valid C identifiers.

Generating arithmetic and comparisons

Built-in binary operators compile to infix C operators:

#![allow(unused)]
fn main() {
Expr::Call(func, args) => {
    if let Expr::Symbol(name) = func.as_ref() {
        match name.as_str() {
            // Arithmetic
            "+" => format!("({} + {})", gen_expr(&args[0]), gen_expr(&args[1])),
            "-" => format!("({} - {})", gen_expr(&args[0]), gen_expr(&args[1])),
            "*" => format!("({} * {})", gen_expr(&args[0]), gen_expr(&args[1])),
            "/" => format!("({} / {})", gen_expr(&args[0]), gen_expr(&args[1])),
            // Comparisons
            "=" => format!("({} == {})", gen_expr(&args[0]), gen_expr(&args[1])),
            "<" => format!("({} < {})", gen_expr(&args[0]), gen_expr(&args[1])),
            ">" => format!("({} > {})", gen_expr(&args[0]), gen_expr(&args[1])),
            "<=" => format!("({} <= {})", gen_expr(&args[0]), gen_expr(&args[1])),
            ">=" => format!("({} >= {})", gen_expr(&args[0]), gen_expr(&args[1])),
            // Boolean operators
            "not" => format!("(!{})", gen_expr(&args[0])),
            "and" => format!("({} && {})", gen_expr(&args[0]), gen_expr(&args[1])),
            "or" => format!("({} || {})", gen_expr(&args[0]), gen_expr(&args[1])),
            // Display
            "display" => gen_display(&args[0]),
            "newline" => "newline_()".to_string(),
            // User-defined function call
            _ => {
                let mangled = mangle(name);
                let arg_strs: Vec<String> = args.iter().map(gen_expr).collect();
                format!("{}({})", mangled, arg_strs.join(", "))
            }
        }
    } else {
        // Calling a non-symbol expression (e.g., a lambda)
        let func_str = gen_expr(func);
        let arg_strs: Vec<String> = args.iter().map(gen_expr).collect();
        format!("{}({})", func_str, arg_strs.join(", "))
    }
}
}

Each arithmetic and comparison operation is wrapped in parentheses to preserve correct precedence in the C output.

Generating display calls

display in MiniLisp is polymorphic, but our compiler knows the type statically. We dispatch based on the argument’s AST type:

#![allow(unused)]
fn main() {
fn gen_display(expr: &Expr) -> String {
    match expr {
        Expr::Int(_) => format!("display_int({})", gen_expr(expr)),
        Expr::Bool(_) => format!("display_bool({})", gen_expr(expr)),
        Expr::Str(_) => format!("display_str({})", gen_expr(expr)),
        // For expressions whose type we can't determine statically,
        // default to display_int (integers are the most common case)
        _ => format!("display_int({})", gen_expr(expr)),
    }
}
}

This is a simplification. A production compiler would either track types through the AST or use a tagged union. For our purposes, defaulting to display_int for computed expressions (like (display (+ 1 2))) is correct because all our arithmetic produces integers.

Tests

#![allow(unused)]
fn main() {
#[test]
fn gen_integer() {
    assert_eq!(gen_expr(&Expr::Int(42)), "42LL");
}

#[test]
fn gen_bool() {
    assert_eq!(gen_expr(&Expr::Bool(true)), "TRUE");
}

#[test]
fn gen_addition() {
    let expr = Expr::Call(
        Box::new(Expr::Symbol("+".to_string())),
        vec![Expr::Int(1), Expr::Int(2)],
    );
    assert_eq!(gen_expr(&expr), "(1LL + 2LL)");
}

#[test]
fn mangle_hyphen() {
    assert_eq!(mangle("my-var"), "my_dash_var");
}
}

14. Generating C: Definitions and Functions

Top-level define forms and lambda expressions compile to C function and variable declarations. This section covers how to emit forward declarations (so mutual recursion works), how to turn a MiniLisp parameter list into a C function signature, how lambda compiles to a named C function, and how top-level definitions are ordered in the output file.

The output file structure

The generated C file has four sections, in this order:

/* 1. Preamble (§12) */
#include <stdio.h>
// ...

/* 2. Forward declarations */
long long square(long long x);
long long fact(long long n);

/* 3. Function definitions */
long long square(long long x) {
    return (x * x);
}

long long fact(long long n) {
    return (n == 0LL) ? 1LL : (n * fact((n - 1LL)));
}

/* 4. main() — top-level expressions */
int main() {
    display_int(square(5LL));
    newline_();
    return 0;
}

Forward declarations allow functions to call each other regardless of definition order. This mirrors how MiniLisp works — order of definitions does not matter for function calls.

Generating function definitions

A DefineFunc compiles to a C function. All parameters and return values use long long for simplicity (MiniLisp does not have a type system, and integers are the primary data type):

#![allow(unused)]
fn main() {
fn gen_func_decl(name: &str, params: &[String]) -> String {
    let mangled_name = mangle(name);
    let param_list: Vec<String> = params
        .iter()
        .map(|p| format!("long long {}", mangle(p)))
        .collect();
    format!("long long {}({})", mangled_name, param_list.join(", "))
}

fn gen_func_def(name: &str, params: &[String], body: &[Expr]) -> String {
    let decl = gen_func_decl(name, params);
    let body_strs: Vec<String> = body.iter().map(gen_expr).collect();

    // All body expressions except the last are statements;
    // the last is the return value
    let mut lines = Vec::new();
    for (i, s) in body_strs.iter().enumerate() {
        if i == body_strs.len() - 1 {
            lines.push(format!("    return {};", s));
        } else {
            lines.push(format!("    {};", s));
        }
    }

    format!("{} {{\n{}\n}}", decl, lines.join("\n"))
}
}

For example, (define (square x) (* x x)) becomes:

long long square(long long x) {
    return (x * x);
}

Generating variable definitions

A Define (variable, not function) compiles to a global variable:

#![allow(unused)]
fn main() {
fn gen_var_def(name: &str, value: &Expr) -> String {
    format!("long long {} = {};", mangle(name), gen_expr(value))
}
}

For example, (define pi 3) becomes:

long long pi = 3LL;

Lambda lifting

A lambda expression needs a name in C — C does not have anonymous functions. We generate a unique name for each lambda:

#![allow(unused)]
fn main() {
use std::sync::atomic::{AtomicUsize, Ordering};

static LAMBDA_COUNTER: AtomicUsize = AtomicUsize::new(0);

fn fresh_lambda_name() -> String {
    let n = LAMBDA_COUNTER.fetch_add(1, Ordering::SeqCst);
    format!("__lambda_{}", n)
}
}

When the code generator encounters a Lambda, it:

  1. Generates a named function definition using the fresh name.
  2. Adds it to a list of “lifted” functions that will be emitted before main.
  3. Returns the function’s name as the C expression.
#![allow(unused)]
fn main() {
fn gen_lambda(
    params: &[String],
    body: &[Expr],
    lifted: &mut Vec<String>,
) -> String {
    let name = fresh_lambda_name();
    let def = gen_func_def(&name, params, body);
    lifted.push(def);
    name
}
}

Putting function generation together

The top-level generation function separates definitions from expressions:

#![allow(unused)]
fn main() {
pub fn generate(exprs: &[Expr]) -> String {
    let mut forward_decls = Vec::new();
    let mut func_defs = Vec::new();
    let mut main_stmts = Vec::new();
    let mut lifted = Vec::new();

    for expr in exprs {
        match expr {
            Expr::DefineFunc(name, params, body) => {
                forward_decls.push(format!("{};", gen_func_decl(name, params)));
                func_defs.push(gen_func_def(name, params, body));
            }
            Expr::Define(name, value) => {
                main_stmts.push(format!(
                    "long long {} = {};",
                    mangle(name),
                    gen_expr(value)
                ));
            }
            _ => {
                main_stmts.push(format!("{};", gen_expr(expr)));
            }
        }
    }

    let mut output = String::new();
    output.push_str(PREAMBLE);

    for decl in &forward_decls {
        output.push_str(decl);
        output.push('\n');
    }
    output.push('\n');

    for def in &func_defs {
        output.push_str(def);
        output.push_str("\n\n");
    }

    for lf in &lifted {
        output.push_str(lf);
        output.push_str("\n\n");
    }

    output.push_str("int main() {\n");
    for stmt in &main_stmts {
        output.push_str("    ");
        output.push_str(stmt);
        output.push('\n');
    }
    output.push_str("    return 0;\n");
    output.push_str("}\n");

    output
}
}

15. Generating C: Control Flow and Sequencing

if, begin, and let each require their own code-generation strategy. if becomes a C ternary expression or an if/else statement depending on context; begin becomes a sequence of C statements with the last value forwarded; let introduces a C block with local variable declarations. This section works through each form and resolves the practical question of when to emit expressions versus statements.

The expression vs. statement problem

C distinguishes expressions (produce a value) from statements (do not). MiniLisp does not — everything is an expression. This creates a tension: sometimes a MiniLisp form appears where a C expression is needed (e.g., as a function argument), and sometimes it appears where a C statement is fine (e.g., at the top level).

Our strategy: generate expressions wherever possible, falling back to statement blocks only when necessary. The ternary operator ? : lets us keep if as an expression. begin and let use GCC’s “statement expressions” extension ({ ... }) to remain in expression position.

Generating if

An if compiles to a C ternary expression:

#![allow(unused)]
fn main() {
Expr::If(cond, then, els) => {
    format!(
        "({} ? {} : {})",
        gen_expr(cond),
        gen_expr(then),
        gen_expr(els),
    )
}
}

MiniLisp (if (> x 0) x (- 0 x)) becomes C ((x > 0LL) ? x : (0LL - x)).

The ternary operator works because our if always has both branches. If we supported if without an else branch, we would need a different strategy.

Generating begin

begin evaluates a sequence of expressions and returns the value of the last one. We use GCC’s statement-expression extension:

#![allow(unused)]
fn main() {
Expr::Begin(exprs) => {
    let mut parts = Vec::new();
    for (i, e) in exprs.iter().enumerate() {
        if i == exprs.len() - 1 {
            parts.push(gen_expr(e));
        } else {
            parts.push(format!("{};", gen_expr(e)));
        }
    }
    format!("({{ {} }})", parts.join(" "))
}
}

MiniLisp:

(begin
  (display 1)
  (newline)
  42)

Becomes C:

({ display_int(1LL); newline_(); 42LL; })

The statement-expression ({ ... }) evaluates each statement in order and produces the value of the last expression. This is a GCC/Clang extension, not standard C. For portability, you could instead generate a helper function, but the extension keeps things simple.

Generating let

let introduces local variables. We generate a statement-expression with local declarations:

#![allow(unused)]
fn main() {
Expr::Let(bindings, body) => {
    let mut parts = Vec::new();
    for (name, val) in bindings {
        parts.push(format!(
            "long long {} = {};",
            mangle(name),
            gen_expr(val),
        ));
    }
    for (i, e) in body.iter().enumerate() {
        if i == body.len() - 1 {
            parts.push(gen_expr(e));
        } else {
            parts.push(format!("{};", gen_expr(e)));
        }
    }
    format!("({{ {} }})", parts.join(" "))
}
}

MiniLisp:

(let ((x 10)
      (y 20))
  (+ x y))

Becomes C:

({ long long x = 10LL; long long y = 20LL; (x + y); })

Complete gen_expr

Here is the full gen_expr function with all variants:

#![allow(unused)]
fn main() {
fn gen_expr(expr: &Expr) -> String {
    match expr {
        Expr::Int(n) => format!("{}LL", n),
        Expr::Bool(b) => if *b { "TRUE".to_string() } else { "FALSE".to_string() },
        Expr::Str(s) => {
            let escaped = s.replace('\\', "\\\\")
                           .replace('"', "\\\"")
                           .replace('\n', "\\n")
                           .replace('\t', "\\t");
            format!("\"{}\"", escaped)
        }
        Expr::Symbol(name) => mangle(name),
        Expr::Call(func, args) => {
            // ... (as shown in §13)
            gen_call(func, args)
        }
        Expr::If(cond, then, els) => {
            format!("({} ? {} : {})",
                gen_expr(cond), gen_expr(then), gen_expr(els))
        }
        Expr::Begin(exprs) => {
            let mut parts = Vec::new();
            for (i, e) in exprs.iter().enumerate() {
                if i == exprs.len() - 1 {
                    parts.push(gen_expr(e));
                } else {
                    parts.push(format!("{};", gen_expr(e)));
                }
            }
            format!("({{ {} }})", parts.join(" "))
        }
        Expr::Let(bindings, body) => {
            let mut parts = Vec::new();
            for (name, val) in bindings {
                parts.push(format!("long long {} = {};",
                    mangle(name), gen_expr(val)));
            }
            for (i, e) in body.iter().enumerate() {
                if i == body.len() - 1 {
                    parts.push(gen_expr(e));
                } else {
                    parts.push(format!("{};", gen_expr(e)));
                }
            }
            format!("({{ {} }})", parts.join(" "))
        }
        Expr::Lambda(params, body) => {
            // Lambda lifting is handled at the top level;
            // if we encounter a lambda in expression position,
            // we would need to lift it. For simplicity, we
            // generate an inline comment.
            let name = fresh_lambda_name();
            format!("/* lambda {} */", name)
        }
        Expr::Define(name, val) => {
            format!("long long {} = {}", mangle(name), gen_expr(val))
        }
        Expr::DefineFunc(_, _, _) => {
            // Function definitions are handled at the top level
            String::new()
        }
    }
}
}

Tests

#![allow(unused)]
fn main() {
#[test]
fn gen_if() {
    let expr = Expr::If(
        Box::new(Expr::Bool(true)),
        Box::new(Expr::Int(1)),
        Box::new(Expr::Int(0)),
    );
    assert_eq!(gen_expr(&expr), "(TRUE ? 1LL : 0LL)");
}

#[test]
fn gen_let() {
    let expr = Expr::Let(
        vec![("x".to_string(), Expr::Int(10))],
        vec![Expr::Symbol("x".to_string())],
    );
    assert_eq!(gen_expr(&expr), "({ long long x = 10LL; x })");
}

#[test]
fn gen_begin() {
    let expr = Expr::Begin(vec![Expr::Int(1), Expr::Int(2)]);
    assert_eq!(gen_expr(&expr), "({ 1LL; 2LL })");
}
}

Part 5 — Putting It Together

16. The Compilation Pipeline

With all stages implemented, this section wires them into a single compile function and builds a CLI entry point that reads MiniLisp from a file or stdin and writes C to stdout or a file. You will add basic error reporting that shows the source location of each failure and trace a complete example — a recursive factorial function — through every stage.

The compile function

#![allow(unused)]
fn main() {
// In src/main.rs

mod ast;
mod parser;
mod analysis;
mod codegen;

use std::io::{self, Read};
use std::process;

fn compile(source: &str) -> Result<String, Vec<String>> {
    // Stage 1: Parse
    let exprs = match parser::parse_program(source) {
        Ok(("", exprs)) => exprs,
        Ok((remaining, _)) => {
            return Err(vec![format!(
                "parse error: unexpected trailing input: {}",
                &remaining[..remaining.len().min(40)]
            )]);
        }
        Err(e) => {
            return Err(vec![format!("parse error: {}", e)]);
        }
    };

    // Stage 2: Analyse
    analysis::analyse(&exprs)?;

    // Stage 3: Generate
    Ok(codegen::generate(&exprs))
}
}

The function takes MiniLisp source as a string and returns either the generated C code or a list of error messages.

The CLI entry point

fn main() {
    let mut input = String::new();
    io::stdin()
        .read_to_string(&mut input)
        .expect("failed to read stdin");

    match compile(&input) {
        Ok(c_code) => print!("{}", c_code),
        Err(errors) => {
            for err in &errors {
                eprintln!("error: {}", err);
            }
            process::exit(1);
        }
    }
}

Usage:

# Compile MiniLisp to C
echo '(display 42) (newline)' | cargo run > out.c

# Compile and run the C
cc -o out out.c && ./out
42

Tracing an example: factorial

Let us trace (define (fact n) (if (= n 0) 1 (* n (fact (- n 1))))) (display (fact 10)) (newline) through every stage.

Stage 1: Parse

The parser produces:

#![allow(unused)]
fn main() {
[
    DefineFunc("fact", ["n"], [
        If(
            Call(Symbol("="), [Symbol("n"), Int(0)]),
            Int(1),
            Call(Symbol("*"), [
                Symbol("n"),
                Call(Symbol("fact"), [
                    Call(Symbol("-"), [Symbol("n"), Int(1)])
                ])
            ])
        )
    ]),
    Call(Symbol("display"), [
        Call(Symbol("fact"), [Int(10)])
    ]),
    Call(Symbol("newline"), []),
]
}

Stage 2: Analyse

The analyser checks:

  • fact is defined (by DefineFunc), so the recursive call is valid.
  • n is a parameter of fact, so it is in scope inside the body.
  • =, *, -, display, newline are built-in — always in scope.
  • The if has three sub-expressions. All forms are well-shaped.

Result: no errors.

Stage 3: Generate

The code generator produces:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef int bool_t;
#define TRUE 1
#define FALSE 0

// ... (display functions, etc.)

long long fact(long long n);

long long fact(long long n) {
    return ((n == 0LL) ? 1LL : (n * fact((n - 1LL))));
}

int main() {
    display_int(fact(10LL));
    newline_();
    return 0;
}

Compile and run:

$ cc -o fact out.c && ./fact
3628800

Error reporting

When the compiler encounters errors, it prints them to stderr and exits with code 1:

$ echo '(display x)' | cargo run
error: undefined symbol: x
$ echo '(if #t 1)' | cargo run
error: parse error: ...

The parser error will mention the position where parsing failed. The semantic analyser’s errors identify the problematic symbol or form by name.


17. Testing the Compiler

Good tests are what turn a working prototype into a reliable tool. This section adds unit tests for each compiler stage and integration tests that compile MiniLisp programs, feed the C output to cc, run the binary, and assert on stdout. You will build a small test corpus of MiniLisp programs covering all language features and ensure the compiler handles both valid and invalid input gracefully.

Unit test structure

Each module has inline unit tests in a #[cfg(test)] module. We have already seen examples throughout the course. Here is a summary of what to test in each module:

ast.rs — test Display:

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn display_int() {
        assert_eq!(format!("{}", Expr::Int(42)), "42");
    }

    #[test]
    fn display_call() {
        let expr = Expr::Call(
            Box::new(Expr::Symbol("+".to_string())),
            vec![Expr::Int(1), Expr::Int(2)],
        );
        assert_eq!(format!("{}", expr), "(+ 1 2)");
    }
}
}

parser.rs — test parsing of every construct:

#![allow(unused)]
fn main() {
#[test]
fn parse_factorial() {
    let input = "(define (fact n) (if (= n 0) 1 (* n (fact (- n 1)))))";
    let (remaining, expr) = parse_expr(input).unwrap();
    assert_eq!(remaining, "");
    match expr {
        Expr::DefineFunc(name, params, body) => {
            assert_eq!(name, "fact");
            assert_eq!(params, vec!["n".to_string()]);
            assert_eq!(body.len(), 1); // single if expression
        }
        _ => panic!("expected DefineFunc"),
    }
}
}

analysis.rs — test valid and invalid programs:

#![allow(unused)]
fn main() {
#[test]
fn recursive_function_is_valid() {
    let exprs = vec![Expr::DefineFunc(
        "f".to_string(),
        vec!["x".to_string()],
        vec![Expr::Call(
            Box::new(Expr::Symbol("f".to_string())),
            vec![Expr::Symbol("x".to_string())],
        )],
    )];
    assert!(analyse(&exprs).is_ok());
}
}

codegen.rs — test C output for known inputs:

#![allow(unused)]
fn main() {
#[test]
fn gen_full_program() {
    let exprs = vec![
        Expr::Call(
            Box::new(Expr::Symbol("display".to_string())),
            vec![Expr::Int(42)],
        ),
    ];
    let output = generate(&exprs);
    assert!(output.contains("display_int(42LL)"));
    assert!(output.contains("int main()"));
}
}

Integration tests

Integration tests live in the tests/ directory. Each test compiles a MiniLisp program, runs the C compiler, executes the binary, and asserts on the output:

#![allow(unused)]
fn main() {
// tests/integration.rs

use std::process::Command;
use std::io::Write;

fn compile_and_run(minilisp: &str) -> String {
    // Run our compiler
    let mut child = Command::new("cargo")
        .args(["run", "--quiet"])
        .stdin(std::process::Stdio::piped())
        .stdout(std::process::Stdio::piped())
        .stderr(std::process::Stdio::piped())
        .spawn()
        .expect("failed to start compiler");

    child
        .stdin
        .as_mut()
        .unwrap()
        .write_all(minilisp.as_bytes())
        .unwrap();

    let output = child.wait_with_output().unwrap();
    assert!(
        output.status.success(),
        "compiler failed: {}",
        String::from_utf8_lossy(&output.stderr)
    );

    let c_code = String::from_utf8(output.stdout).unwrap();

    // Write C to a temp file
    let dir = tempfile::tempdir().unwrap();
    let c_path = dir.path().join("test.c");
    let bin_path = dir.path().join("test");
    std::fs::write(&c_path, &c_code).unwrap();

    // Compile C
    let cc = Command::new("cc")
        .args([
            c_path.to_str().unwrap(),
            "-o",
            bin_path.to_str().unwrap(),
        ])
        .output()
        .expect("failed to run cc");
    assert!(
        cc.status.success(),
        "cc failed: {}",
        String::from_utf8_lossy(&cc.stderr)
    );

    // Run the binary
    let run = Command::new(bin_path)
        .output()
        .expect("failed to run binary");

    String::from_utf8(run.stdout).unwrap()
}

#[test]
fn test_display_integer() {
    let output = compile_and_run("(display 42) (newline)");
    assert_eq!(output, "42\n");
}

#[test]
fn test_arithmetic() {
    let output = compile_and_run("(display (+ 3 (* 4 5))) (newline)");
    assert_eq!(output, "23\n");
}

#[test]
fn test_if_true() {
    let output = compile_and_run("(display (if #t 1 0)) (newline)");
    assert_eq!(output, "1\n");
}

#[test]
fn test_if_false() {
    let output = compile_and_run("(display (if #f 1 0)) (newline)");
    assert_eq!(output, "0\n");
}

#[test]
fn test_define_and_use() {
    let output = compile_and_run("(define x 10) (display x) (newline)");
    assert_eq!(output, "10\n");
}

#[test]
fn test_function_definition() {
    let output = compile_and_run(
        "(define (double x) (* x 2)) (display (double 21)) (newline)"
    );
    assert_eq!(output, "42\n");
}

#[test]
fn test_recursion() {
    let output = compile_and_run(
        "(define (fact n) (if (= n 0) 1 (* n (fact (- n 1))))) \
         (display (fact 10)) (newline)"
    );
    assert_eq!(output, "3628800\n");
}

#[test]
fn test_let_binding() {
    let output = compile_and_run(
        "(display (let ((x 10) (y 20)) (+ x y))) (newline)"
    );
    assert_eq!(output, "30\n");
}

#[test]
fn test_begin() {
    let output = compile_and_run(
        "(begin (display 1) (display 2) (display 3)) (newline)"
    );
    assert_eq!(output, "123\n");
}

#[test]
fn test_string_display() {
    let output = compile_and_run("(display \"hello world\") (newline)");
    assert_eq!(output, "hello world\n");
}

#[test]
fn test_boolean_display() {
    let output = compile_and_run("(display #t) (newline)");
    assert_eq!(output, "#t\n");
}

#[test]
fn test_comparison() {
    let output = compile_and_run(
        "(display (if (< 3 5) 1 0)) (newline)"
    );
    assert_eq!(output, "1\n");
}
}

Test corpus checklist

Ensure you have tests covering:

  • Integer literals (positive, negative, zero)
  • Boolean literals
  • String literals (with escape sequences)
  • Arithmetic operators (+, -, *, /)
  • Comparison operators (=, <, >, <=, >=)
  • Boolean operators (not, and, or)
  • display for each type
  • newline
  • define (variable)
  • define (function, including recursive)
  • if (both branches)
  • lambda
  • let bindings
  • begin sequencing
  • Nested expressions
  • Comments in source
  • Invalid programs (undefined symbols, malformed forms)

Running the tests

# Unit tests
cargo test

# Integration tests (requires cc)
cargo test --test integration

18. What’s Next: Extensions and Further Reading

The compiler you have built is deliberately minimal — a solid foundation. This final section surveys the directions you can take it further: tail-call optimisation, closures and lambda lifting, a garbage collector, hygienic macros, a type system, an interactive REPL, and a self-hosting MiniLisp standard library. It closes with a curated reading list for going deeper into compiler theory and Lisp implementation.

Extensions

Tail-call optimisation. Recursive functions like fact currently grow the C call stack with every recursive call. In Lisp, tail calls are expected to run in constant space. You can implement this by detecting tail-position calls and compiling them as goto jumps or loops in the generated C.

Closures and lambda lifting. Our compiler handles lambdas that do not capture variables from their enclosing scope. True closures — lambdas that reference variables from outer scopes — require either lambda lifting (passing free variables as extra parameters) or a closure data structure that packages the function pointer with its captured environment.

Garbage collection. MiniLisp allocates strings but never frees them. For long-running programs, you would add a garbage collector. A simple mark-and-sweep collector is a good starting point: maintain a list of all allocated objects, periodically mark the ones reachable from the stack and global variables, and free the rest.

Hygienic macros. Lisp macros transform code before it is evaluated. A macro system would let MiniLisp users define their own special forms. Hygienic macros (as in Scheme’s syntax-rules) ensure that macro-generated code does not accidentally capture user variables.

A type system. Adding type annotations and inference would catch errors like passing a string to + at compile time instead of generating C that segfaults. Hindley-Milner type inference is the classic starting point for functional languages.

An interactive REPL. A Read-Eval-Print Loop would let users type MiniLisp expressions and see results immediately. You could implement this by compiling each expression to C, linking it dynamically, and calling it — or by adding an interpreter mode alongside the compiler.

A standard library. Write a MiniLisp standard library in MiniLisp: list operations (map, filter, fold), higher-order functions, and utility procedures. This exercises the compiler and reveals any limitations in the language.

Self-hosting. The ultimate test of a compiler: rewrite the compiler itself in MiniLisp. This requires adding enough features to MiniLisp (file I/O, string manipulation, data structures) to make it practical, then translating the Rust code into MiniLisp.

Further reading

Compiler construction:

  • Compilers: Principles, Techniques, and Tools by Aho, Lam, Sethi, and Ullman (the “Dragon Book”) — the canonical compiler textbook.
  • Engineering a Compiler by Cooper and Torczon — a modern, practical alternative.
  • Crafting Interpreters by Robert Nystrom — an excellent, free online book that walks through building two complete interpreters.

Lisp implementation:

  • Structure and Interpretation of Computer Programs (SICP) by Abelson and Sussman — the classic computer science textbook, centred on Scheme.
  • Lisp in Small Pieces by Christian Queinnec — deep coverage of Lisp implementation strategies, from interpreters to compilers.
  • An Incremental Approach to Compiler Construction by Abdulaziz Ghuloum — a paper describing how to build a Scheme-to-x86 compiler incrementally, one feature at a time.

Rust and parsing:

  • The nom documentation and nom recipes — essential reference for parser combinators in Rust.
  • Programming Rust by Blandy, Orendorff, and Tindall — comprehensive coverage of Rust, including traits, generics, and the borrow checker.

Congratulations

You have built a working compiler from scratch. It reads MiniLisp, validates it, and emits C. You understand the core pipeline — parsing, semantic analysis, code generation — that underlies every compiler, from GCC to the Rust compiler itself. The techniques you have learned here transfer directly to building compilers, interpreters, static analysers, linters, and code formatters for any language. The next step is yours: pick an extension, implement it, and keep building.