Writing a Lisp-to-C Compiler in Rust
This course walks you through building a complete, working compiler from scratch. You will write every component yourself — a lexer, a parser, a semantic analyser, and a code generator — ending with a program that reads MiniLisp source code and emits valid C. The compiler is written in Rust and uses the nom parser-combinator library for all parsing work.
Table of Contents
Part 1 — Foundations
- Introduction: What We’re Building
- MiniLisp Language Specification
- Compiler Architecture: The Pipeline
Part 2 — Parsing with nom
- Introduction to nom: Parser Combinators
- Setting Up the Project
- Recognizing Atoms: Integers, Booleans, Strings, Symbols
- The Abstract Syntax Tree
- Parsing Atoms with nom
- Parsing S-Expressions and Special Forms
Part 3 — Semantic Analysis
Part 4 — Code Generation
- The C Runtime Preamble
- Generating C: Atoms and Expressions
- Generating C: Definitions and Functions
- Generating C: Control Flow and Sequencing
Part 5 — Putting It Together
Part 1 — Foundations
1. Introduction: What We’re Building
A compiler is a program that transforms source code written in one language into equivalent code in another. By the end of this course you will have written one that accepts MiniLisp — a small, clean dialect of Lisp — and produces human-readable C that you can compile and run with any standard C compiler. Along the way you will implement each classic compiler stage from scratch: lexical analysis, parsing, semantic analysis, and code generation.
Why Lisp-to-C?
Lisp is the ideal source language for a first compiler because its syntax is almost trivial to parse. Every program is either an atom (a number, string, boolean, or name) or a list of expressions wrapped in parentheses. There are no operator-precedence rules, no statement-versus-expression distinctions, and no ambiguous grammars. This lets you focus on what a compiler does rather than fighting with syntax.
C is the ideal target language because it is close to the machine — you can see exactly how each Lisp construct translates to something concrete — yet still readable. You will not be writing assembly; the C compiler handles that final step.
What you will learn
By completing this course you will understand:
- Lexical analysis and parsing — how raw text becomes structured data.
- Abstract syntax trees — how compilers represent programs internally.
- Semantic analysis — how a compiler validates that a program makes sense before generating code.
- Code generation — how high-level constructs map to lower-level constructs in a target language.
- nom parser combinators — a practical, idiomatic Rust approach to parsing.
What you will build
The finished compiler is a single Rust binary. You pipe MiniLisp source in and get C source out:
$ echo '(define (square x) (* x x)) (display (square 5))' | minilisp-cc > out.c
$ cc -o out out.c && ./out
25
The compiler handles integers, booleans, strings, arithmetic, comparisons, conditionals, local bindings, sequencing, first-class named functions, and recursion. It is deliberately minimal — roughly 800 lines of Rust — so you can read and understand every line.
Prerequisites
You should be comfortable reading and writing Rust. You do not need prior experience with compilers, Lisp, or nom — this course introduces all three from scratch. You should have a working Rust toolchain (rustc, cargo) and a C compiler (cc or gcc) installed.
Course structure
The course is divided into five parts:
- Foundations — what MiniLisp looks like, what the compiler’s architecture is.
- Parsing with nom — turning source text into an AST.
- Semantic Analysis — validating the AST.
- Code Generation — turning the AST into C.
- Putting It Together — wiring the stages into a working tool and testing it.
Each section builds directly on the previous one. Code samples are cumulative: by the end, every snippet fits together into the complete compiler.
2. MiniLisp Language Specification
MiniLisp is the source language of our compiler. It is a minimal Lisp dialect with integers, booleans, strings, first-class functions, lexical scope, and a small set of built-in operators. This section defines every syntactic form precisely, gives the grammar in EBNF, and shows a complete example program so you know exactly what the compiler must handle before you write a single line of Rust.
Data types
MiniLisp has four data types:
| Type | Examples | Notes |
|---|---|---|
| Integer | 42, -7, 0 | Signed 64-bit integers |
| Boolean | #t, #f | True and false |
| String | "hello", "line\n" | Double-quoted, with \\, \", \n, \t escapes |
| Function | (lambda (x) (* x x)) | First-class, lexically scoped |
There are no floating-point numbers, characters, lists-as-data, or nil/null values. This keeps the compiler simple while still being expressive enough to write interesting programs.
Expressions
Everything in MiniLisp is an expression — there are no statements. Every form evaluates to a value.
Atoms are the simplest expressions:
42 ; integer
#t ; boolean true
#f ; boolean false
"hello" ; string
x ; symbol (variable reference)
Function calls use prefix notation inside parentheses. The first element is the operator or function; the rest are arguments:
(+ 1 2) ; => 3
(* 3 (+ 1 2)) ; => 9
(display "hi") ; prints "hi", returns void
Built-in operators:
| Operator | Arity | Description |
|---|---|---|
+, -, *, / | 2 | Arithmetic (integer division for /) |
=, <, >, <=, >= | 2 | Comparison (returns #t or #f) |
not | 1 | Boolean negation |
and, or | 2 | Boolean connectives (not short-circuiting) |
display | 1 | Print a value to stdout |
newline | 0 | Print a newline character |
Special forms
Special forms look like function calls but have special evaluation rules. The compiler recognises them by their leading keyword.
define — top-level definitions
;; Define a variable
(define pi 3)
;; Define a function (syntactic sugar)
(define (square x) (* x x))
;; The function form is equivalent to:
(define square (lambda (x) (* x x)))
if — conditional
(if (> x 0) x (- 0 x)) ; absolute value
Always three sub-expressions: condition, then-branch, else-branch.
lambda — anonymous function
(lambda (x y) (+ x y))
Parameter list followed by one or more body expressions. The value of the last body expression is the return value.
let — local bindings
(let ((x 10)
(y 20))
(+ x y))
A list of (name value) bindings followed by one or more body expressions. Bindings are not mutually visible (this is let, not let*).
begin — sequencing
(begin
(display "hello")
(newline)
42)
Evaluates each expression in order, returns the value of the last one.
Comments
A semicolon ; starts a comment that extends to the end of the line:
(define x 10) ; this is a comment
Grammar (EBNF)
program = { expr } ;
expr = atom | list ;
atom = integer | boolean | string | symbol ;
integer = [ "-" ] digit { digit } ;
boolean = "#t" | "#f" ;
string = '"' { string_char } '"' ;
string_char = escape | any character except '"' and '\\' ;
escape = '\\' ( '"' | '\\' | 'n' | 't' ) ;
symbol = symbol_start { symbol_cont } ;
symbol_start = letter | '_' | '+' | '-' | '*' | '/' | '='
| '<' | '>' | '!' | '?' ;
symbol_cont = symbol_start | digit ;
list = '(' { ws } { expr { ws } } ')' ;
ws = ' ' | '\t' | '\n' | '\r' | comment ;
comment = ';' { any character except '\n' } '\n' ;
Complete example program
;; Recursive factorial
(define (fact n)
(if (= n 0)
1
(* n (fact (- n 1)))))
;; Compute and display 10!
(display (fact 10))
(newline)
This program defines a recursive factorial function, computes 10!, and prints 3628800 followed by a newline.
3. Compiler Architecture: The Pipeline
Our compiler is a classic multi-stage pipeline: source text passes through a parser, producing an AST; the AST passes through a semantic analyser, which validates scope and form usage; the validated AST passes through a code generator, which emits C. This section maps that pipeline onto the module structure you will build and explains how data and errors flow between stages.
The four stages
MiniLisp source text
│
▼
┌─────────────┐
│ Parser │ (nom combinators)
│ src/parser │
└──────┬──────┘
│ Vec<Expr>
▼
┌─────────────┐
│ Analyser │ (scope checking, form validation)
│ src/analysis │
└──────┬──────┘
│ Vec<Expr> (same type, now validated)
▼
┌─────────────┐
│ Code Gen │ (AST → C strings)
│ src/codegen │
└──────┬──────┘
│ String
▼
C source code
-
Parser — converts raw text into a
Vec<Expr>, whereExpris the AST node type. Uses nom parser combinators. Reports syntax errors with position information. -
Semantic Analyser — walks the AST, builds a symbol table to track variable scopes, and validates that special forms have the correct shape (e.g.
ifhas exactly three sub-expressions). Reports semantic errors. The analyser does not transform the AST — it only checks it. -
Code Generator — recursively traverses the validated AST and produces a
Stringcontaining the complete C program. Emits the runtime preamble first, then forward declarations, then function definitions, thenmain. -
Driver — the
mainfunction wires the stages together: read input, parse, analyse, generate, write output. It also handles CLI arguments and error display.
Module layout
minilisp-cc/
├── Cargo.toml
└── src/
├── main.rs # CLI entry point, pipeline driver
├── ast.rs # Expr enum and Display impl
├── parser.rs # nom parsers
├── analysis.rs # symbol table, scope checking, form validation
└── codegen.rs # C code generation
Each module has a single, clear responsibility. Data flows in one direction: main calls parser, passes the result to analysis, passes that to codegen, and writes the output.
Data flow
The key data type is Expr — the AST node. It is defined once in ast.rs and used by all stages. The parser produces Vec<Expr>. The analyser consumes Vec<Expr> and returns Vec<Expr> (unchanged, but validated). The code generator consumes Vec<Expr> and returns String.
Error handling
Each stage defines its own error type:
#![allow(unused)]
fn main() {
/// A compiler error with a human-readable message.
#[derive(Debug)]
pub struct CompileError {
pub message: String,
}
}
In a production compiler you would track source positions (line and column) in every AST node. For simplicity, our compiler attaches position information only in error messages from the parser (nom provides this automatically) and uses descriptive messages for semantic errors.
The driver collects errors from each stage. If parsing fails, the compiler stops. If analysis finds errors, it reports all of them before stopping (so you see every problem at once, not one at a time). Code generation assumes a valid AST and does not produce errors.
Why this architecture?
Separating the compiler into independent stages has practical benefits:
- Testability — you can unit-test each stage in isolation. Feed a string to the parser and assert on the AST. Feed an AST to the analyser and assert on the errors. Feed a valid AST to the code generator and assert on the C output.
- Debuggability — when something goes wrong, you know which stage to look at.
- Extensibility — adding a new language feature means adding an AST variant, a parser case, an analyser check, and a code generation case. Each change is localised to one module.
Part 2 — Parsing with nom
4. Introduction to nom: Parser Combinators
nom is a parser-combinator library: instead of writing a grammar file and running a generator, you write small Rust functions that each recognise a fragment of input, then combine them into larger parsers. This section introduces the core IResult<I, O, E> type, walks through the essential combinators (tag, char, alt, many0, map, tuple, delimited, preceded), and shows how to write, compose, and test parsers before you apply any of this to MiniLisp.
What is a parser combinator?
A parser is a function that takes some input and either succeeds (returning the parsed value and the remaining input) or fails (returning an error). A combinator is a function that takes one or more parsers and returns a new parser. By combining small parsers you build complex ones without writing a single regular expression or grammar file.
In nom, every parser has this signature:
#![allow(unused)]
fn main() {
fn parser(input: &str) -> IResult<&str, Output>
}
IResult<I, O> is defined roughly as:
#![allow(unused)]
fn main() {
type IResult<I, O> = Result<(I, O), Err<Error<I>>>;
}
On success, you get a tuple of (remaining_input, parsed_value). On failure, you get an error indicating where parsing failed.
Essential combinators
Here are the combinators you will use most. Each is a function in the nom crate.
tag — match a literal string
#![allow(unused)]
fn main() {
use nom::bytes::complete::tag;
fn parse_hello(input: &str) -> IResult<&str, &str> {
tag("hello")(input)
}
assert_eq!(parse_hello("hello world"), Ok((" world", "hello")));
}
char — match a single character
#![allow(unused)]
fn main() {
use nom::character::complete::char;
// Matches the character '('
let result = char('(')("(abc)");
assert_eq!(result, Ok(("abc)", '(')));
}
alt — try alternatives in order
#![allow(unused)]
fn main() {
use nom::branch::alt;
use nom::bytes::complete::tag;
fn parse_bool(input: &str) -> IResult<&str, &str> {
alt((tag("#t"), tag("#f")))(input)
}
assert_eq!(parse_bool("#t rest"), Ok((" rest", "#t")));
assert_eq!(parse_bool("#f rest"), Ok((" rest", "#f")));
}
alt tries each parser in order and returns the result of the first one that succeeds.
map — transform a parser’s output
#![allow(unused)]
fn main() {
use nom::combinator::map;
use nom::bytes::complete::tag;
fn parse_true(input: &str) -> IResult<&str, bool> {
map(tag("#t"), |_| true)(input)
}
assert_eq!(parse_true("#t"), Ok(("", true)));
}
tuple — sequence multiple parsers
#![allow(unused)]
fn main() {
use nom::sequence::tuple;
use nom::character::complete::char;
use nom::bytes::complete::tag;
fn parse_pair(input: &str) -> IResult<&str, (char, &str, char)> {
tuple((char('('), tag("hi"), char(')')))(input)
}
assert_eq!(parse_pair("(hi)"), Ok(("", ('(', "hi", ')'))));
}
delimited — match something between two delimiters
#![allow(unused)]
fn main() {
use nom::sequence::delimited;
use nom::character::complete::char;
use nom::bytes::complete::tag;
fn parse_parens(input: &str) -> IResult<&str, &str> {
delimited(char('('), tag("hi"), char(')'))(input)
}
assert_eq!(parse_parens("(hi)"), Ok(("", "hi")));
}
delimited discards the opening and closing delimiters and returns only the inner value.
preceded — match a prefix and discard it
#![allow(unused)]
fn main() {
use nom::sequence::preceded;
use nom::character::complete::char;
use nom::bytes::complete::tag;
fn parse_hash_t(input: &str) -> IResult<&str, &str> {
preceded(char('#'), tag("t"))(input)
}
assert_eq!(parse_hash_t("#t"), Ok(("", "t")));
}
many0 — match zero or more repetitions
#![allow(unused)]
fn main() {
use nom::multi::many0;
use nom::character::complete::char;
fn parse_stars(input: &str) -> IResult<&str, Vec<char>> {
many0(char('*'))(input)
}
assert_eq!(parse_stars("***abc"), Ok(("abc", vec!['*', '*', '*'])));
assert_eq!(parse_stars("abc"), Ok(("abc", vec![])));
}
Composing parsers
The power of combinators is composition. Here is a parser that recognises a parenthesised, comma-separated pair of integers:
#![allow(unused)]
fn main() {
use nom::character::complete::{char, digit1, multispace0};
use nom::combinator::map_res;
use nom::sequence::{delimited, separated_pair};
fn parse_int(input: &str) -> IResult<&str, i64> {
map_res(digit1, |s: &str| s.parse::<i64>())(input)
}
fn parse_pair(input: &str) -> IResult<&str, (i64, i64)> {
delimited(
char('('),
separated_pair(parse_int, char(','), parse_int),
char(')'),
)(input)
}
assert_eq!(parse_pair("(3,7)"), Ok(("", (3, 7))));
}
Each piece is a small, testable function. You build up complexity by snapping pieces together.
Testing nom parsers
Testing a nom parser is straightforward: call the function with a test input and assert on the result.
#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_parse_int() {
assert_eq!(parse_int("42 rest"), Ok((" rest", 42)));
}
#[test]
fn test_parse_int_fails_on_alpha() {
assert!(parse_int("abc").is_err());
}
}
}
Always test both the happy path and failure cases. Verify that the remaining input is what you expect — a parser that accidentally consumes too much or too little is a common bug.
Exercises
- Write a parser that recognises the string
"true"and returns the Rustboolvaluetrue. - Write a parser that recognises either
"yes"or"no"and returns abool. - Write a parser for a parenthesised integer like
(42)that returns thei64value inside.
5. Setting Up the Project
You will create a new Rust binary crate for the compiler, add nom and any other dependencies to Cargo.toml, and lay out the module structure that the rest of the course fills in. By the end of this section you will have a project that compiles, a src/main.rs that reads from stdin, and placeholder modules for each compiler stage.
Create the crate
cargo new minilisp-cc
cd minilisp-cc
This gives you a standard Rust binary project with src/main.rs and Cargo.toml.
Add dependencies
Edit Cargo.toml:
[package]
name = "minilisp-cc"
version = "0.1.0"
edition = "2021"
[dependencies]
nom = "7"
[profile.release]
opt-level = "z"
lto = true
strip = true
codegen-units = 1
nom version 7 is the current stable release. The release profile settings minimise binary size.
Create the module files
Create one file per compiler stage:
touch src/ast.rs src/parser.rs src/analysis.rs src/codegen.rs
Your src/ directory should now look like this:
src/
├── main.rs
├── ast.rs
├── parser.rs
├── analysis.rs
└── codegen.rs
Wire up modules in main.rs
Replace src/main.rs with:
mod ast;
mod parser;
mod analysis;
mod codegen;
use std::io::{self, Read};
fn main() {
// Read all of stdin into a string
let mut input = String::new();
io::stdin()
.read_to_string(&mut input)
.expect("failed to read stdin");
// TODO: parse, analyse, generate
println!("Read {} bytes", input.len());
}
Add placeholder content to each module
In src/ast.rs:
#![allow(unused)]
fn main() {
/// A MiniLisp expression — the core AST node type.
#[derive(Debug, Clone, PartialEq)]
pub enum Expr {
// Variants will be added in §7
}
}
In src/parser.rs:
#![allow(unused)]
fn main() {
use crate::ast::Expr;
/// Parse MiniLisp source code into a list of expressions.
pub fn parse(_input: &str) -> Result<Vec<Expr>, String> {
todo!("implement in §8–§9")
}
}
In src/analysis.rs:
#![allow(unused)]
fn main() {
use crate::ast::Expr;
/// Validate a list of parsed expressions.
pub fn analyse(_exprs: &[Expr]) -> Result<(), Vec<String>> {
todo!("implement in §10–§11")
}
}
In src/codegen.rs:
#![allow(unused)]
fn main() {
use crate::ast::Expr;
/// Generate C source code from a list of validated expressions.
pub fn generate(_exprs: &[Expr]) -> String {
todo!("implement in §12–§15")
}
}
Verify it compiles
cargo check
You should see a clean compilation with no errors. There will be warnings about unused imports and todo!() macros — that is fine. These will disappear as you fill in the implementations.
Checkpoint
You now have a project skeleton with:
- A binary crate that reads from stdin.
- Four modules with clear responsibilities and placeholder signatures.
- nom as a dependency, ready to use.
The rest of the course fills in these modules one at a time.
6. Recognizing Atoms: Integers, Booleans, Strings, Symbols
Before building the full parser, you need nom parsers for each atomic value in MiniLisp: signed integers, boolean literals #t and #f, double-quoted strings with escape sequences, and symbol identifiers. This section develops each atom parser in isolation, explains the nom combinators used, and provides exercises to test your understanding before the parts are assembled into the full parser.
Parsing integers
A MiniLisp integer is an optional minus sign followed by one or more digits. We use recognize to capture the matched text as a single slice, then parse it into an i64:
#![allow(unused)]
fn main() {
use nom::IResult;
use nom::branch::alt;
use nom::character::complete::{char, digit1};
use nom::combinator::{map_res, opt, recognize};
use nom::sequence::pair;
/// Parse a signed integer literal.
fn parse_integer(input: &str) -> IResult<&str, i64> {
map_res(
recognize(pair(opt(char('-')), digit1)),
|s: &str| s.parse::<i64>(),
)(input)
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn positive_integer() {
assert_eq!(parse_integer("42 rest"), Ok((" rest", 42)));
}
#[test]
fn negative_integer() {
assert_eq!(parse_integer("-7"), Ok(("", -7)));
}
#[test]
fn zero() {
assert_eq!(parse_integer("0"), Ok(("", 0)));
}
}
}
How it works:
opt(char('-'))optionally matches a leading minus sign.digit1matches one or more ASCII digits.pairsequences them so we match-42as a unit.recognizecaptures the entire matched span as a&strinstead of the pair tuple.map_resappliesstr::parse::<i64>()to convert the string to an integer, propagating parse errors as nom errors.
Parsing booleans
Booleans are the literals #t (true) and #f (false):
#![allow(unused)]
fn main() {
use nom::bytes::complete::tag;
use nom::combinator::map;
/// Parse a boolean literal: `#t` or `#f`.
fn parse_boolean(input: &str) -> IResult<&str, bool> {
alt((
map(tag("#t"), |_| true),
map(tag("#f"), |_| false),
))(input)
}
#[test]
fn booleans() {
assert_eq!(parse_boolean("#t"), Ok(("", true)));
assert_eq!(parse_boolean("#f rest"), Ok((" rest", false)));
}
}
How it works:
tag("#t")matches the literal string#t.maptransforms the matched string into the Rustboolvaluetrue.alttries#tfirst, then#f.
Parsing strings
Strings are enclosed in double quotes and support escape sequences \\, \", \n, and \t:
#![allow(unused)]
fn main() {
use nom::bytes::complete::{tag, take_while1};
use nom::character::complete::char;
use nom::multi::many0;
use nom::sequence::delimited;
/// Parse a single character inside a string (handling escapes).
fn string_char(input: &str) -> IResult<&str, char> {
alt((
// Escape sequences
map(tag("\\\\"), |_| '\\'),
map(tag("\\\""), |_| '"'),
map(tag("\\n"), |_| '\n'),
map(tag("\\t"), |_| '\t'),
// Any character that is not a quote or backslash
nom::character::complete::none_of("\"\\"),
))(input)
}
/// Parse a double-quoted string literal.
fn parse_string(input: &str) -> IResult<&str, String> {
delimited(
char('"'),
map(many0(string_char), |chars| chars.into_iter().collect()),
char('"'),
)(input)
}
#[test]
fn simple_string() {
assert_eq!(parse_string("\"hello\""), Ok(("", "hello".to_string())));
}
#[test]
fn string_with_escapes() {
assert_eq!(
parse_string("\"line\\none\""),
Ok(("", "line\none".to_string()))
);
}
#[test]
fn empty_string() {
assert_eq!(parse_string("\"\""), Ok(("", String::new())));
}
}
How it works:
string_charmatches either an escape sequence or a literal character.- Escape sequences are tried first (the
tag("\\n")branch) becausenone_ofwould otherwise consume the backslash. many0(string_char)collects zero or more characters into aVec<char>.mapconverts theVec<char>into aString.delimitedstrips the surrounding double quotes.
Parsing symbols
A symbol is an identifier that starts with a letter, underscore, or one of the operator characters (+, -, *, /, =, <, >, !, ?), followed by zero or more of those characters or digits:
#![allow(unused)]
fn main() {
use nom::combinator::recognize;
use nom::multi::many0_count;
use nom::character::complete::one_of;
/// Match a character that can start a symbol.
fn symbol_start(input: &str) -> IResult<&str, char> {
alt((
nom::character::complete::alpha1
.map(|s: &str| s.chars().next().unwrap()),
one_of("_+-*/=<>!?"),
))(input)
}
/// Parse a symbol identifier.
fn parse_symbol(input: &str) -> IResult<&str, String> {
map(
recognize(pair(
one_of("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_+-*/=<>!?"),
many0_count(one_of(
"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_+-*/=<>!?",
)),
)),
String::from,
)(input)
}
#[test]
fn simple_symbol() {
assert_eq!(parse_symbol("foo"), Ok(("", "foo".to_string())));
}
#[test]
fn operator_symbol() {
assert_eq!(parse_symbol("+ rest"), Ok((" rest", "+".to_string())));
}
#[test]
fn hyphenated_symbol() {
assert_eq!(
parse_symbol("my-var"),
Ok(("", "my-var".to_string()))
);
}
}
How it works:
one_of(...)matches a single character from the given set.- The first
one_ofensures the symbol starts with a valid leading character (not a digit). many0_countmatches zero or more continuation characters but only counts them (we userecognizeto capture the whole span).recognizegives us the entire matched text as a&str, which we convert to aString.
Important ordering note
When combining these parsers later, the order matters. parse_integer and parse_symbol can both start with -. The rule is: try parse_integer first, and only fall back to parse_symbol if it fails. We handle this with alt in the next section.
Exercises
- Extend
parse_stringto support the escape sequence\0(null character). - Write tests for edge cases: the symbol
-(just a minus sign, a valid symbol), the integer-0, and the string"\\\""(a string containing a backslash followed by a quote). - What happens if you try
parse_integeron the input"abc"? Read the nom error and explain what it means.
7. The Abstract Syntax Tree
The parser’s output is an Abstract Syntax Tree — a Rust data structure that captures the meaning of a MiniLisp program without the syntactic noise of parentheses and whitespace. This section defines the Expr enum and its variants, discusses why the tree is structured the way it is, and implements Display so you can inspect parse results during development.
What is an AST?
An Abstract Syntax Tree strips away syntactic details — parentheses, whitespace, comments — and represents only the structural meaning of a program. Two MiniLisp programs that differ only in whitespace or comments produce identical ASTs. The AST is the compiler’s internal representation from the parser onwards; every subsequent stage operates on it.
Designing the Expr enum
There are two common approaches for representing Lisp in an AST:
Option A — Generic lists. Everything is either an atom or a List(Vec<Expr>). Special forms like if and define are not distinguished during parsing; they are recognised later during semantic analysis.
Option B — Specific variants. Each special form gets its own enum variant with typed fields. The parser does more work, but later stages deal with well-typed data.
We use Option B. The benefit is exhaustive pattern matching: if you add a new variant, the Rust compiler forces you to handle it in the analyser and code generator. Bugs become compile errors instead of runtime surprises.
The complete Expr enum
Add this to src/ast.rs:
#![allow(unused)]
fn main() {
/// A MiniLisp expression — the core AST node type.
#[derive(Debug, Clone, PartialEq)]
pub enum Expr {
/// Integer literal: `42`, `-7`
Int(i64),
/// Boolean literal: `#t`, `#f`
Bool(bool),
/// String literal: `"hello"`
Str(String),
/// Symbol (variable or function name): `x`, `+`, `my-func`
Symbol(String),
/// Function call: `(f arg1 arg2 ...)`
/// The first element is the function; the rest are arguments.
Call(Box<Expr>, Vec<Expr>),
/// Variable definition: `(define name value)`
Define(String, Box<Expr>),
/// Function definition: `(define (name params...) body...)`
DefineFunc(String, Vec<String>, Vec<Expr>),
/// Conditional: `(if condition then else)`
If(Box<Expr>, Box<Expr>, Box<Expr>),
/// Anonymous function: `(lambda (params...) body...)`
Lambda(Vec<String>, Vec<Expr>),
/// Local bindings: `(let ((name val) ...) body...)`
Let(Vec<(String, Expr)>, Vec<Expr>),
/// Sequencing: `(begin expr1 expr2 ...)`
Begin(Vec<Expr>),
}
}
Design notes:
Box<Expr>is used wherever a single sub-expression is required. WithoutBox, the enum would be infinitely sized becauseExprcontainsExpr.Vec<Expr>is used for variable-length lists of sub-expressions (function arguments, body expressions).DefineandDefineFuncare separate variants because they have different structures.(define x 10)binds a name to a value;(define (f x) ...)binds a name to a function with parameters and a body.Callstores the function expression (which could be a symbol, a lambda, or even another call) as aBox<Expr>, and the arguments as aVec<Expr>.
How MiniLisp maps to the AST
| MiniLisp | AST |
|---|---|
42 | Expr::Int(42) |
#t | Expr::Bool(true) |
"hi" | Expr::Str("hi".into()) |
x | Expr::Symbol("x".into()) |
(+ 1 2) | Expr::Call(Box::new(Expr::Symbol("+".into())), vec![Expr::Int(1), Expr::Int(2)]) |
(define x 10) | Expr::Define("x".into(), Box::new(Expr::Int(10))) |
(if #t 1 0) | Expr::If(Box::new(Expr::Bool(true)), Box::new(Expr::Int(1)), Box::new(Expr::Int(0))) |
Implementing Display
A Display implementation lets you print ASTs in a readable Lisp-like format, which is invaluable during debugging:
#![allow(unused)]
fn main() {
use std::fmt;
impl fmt::Display for Expr {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
match self {
Expr::Int(n) => write!(f, "{}", n),
Expr::Bool(b) => write!(f, "{}", if *b { "#t" } else { "#f" }),
Expr::Str(s) => write!(f, "\"{}\"", s),
Expr::Symbol(s) => write!(f, "{}", s),
Expr::Call(func, args) => {
write!(f, "({}", func)?;
for arg in args {
write!(f, " {}", arg)?;
}
write!(f, ")")
}
Expr::Define(name, val) => write!(f, "(define {} {})", name, val),
Expr::DefineFunc(name, params, body) => {
write!(f, "(define ({}", name)?;
for p in params {
write!(f, " {}", p)?;
}
write!(f, ")")?;
for expr in body {
write!(f, " {}", expr)?;
}
write!(f, ")")
}
Expr::If(cond, then, els) => {
write!(f, "(if {} {} {})", cond, then, els)
}
Expr::Lambda(params, body) => {
write!(f, "(lambda (")?;
for (i, p) in params.iter().enumerate() {
if i > 0 { write!(f, " ")?; }
write!(f, "{}", p)?;
}
write!(f, ")")?;
for expr in body {
write!(f, " {}", expr)?;
}
write!(f, ")")
}
Expr::Let(bindings, body) => {
write!(f, "(let (")?;
for (i, (name, val)) in bindings.iter().enumerate() {
if i > 0 { write!(f, " ")?; }
write!(f, "({} {})", name, val)?;
}
write!(f, ")")?;
for expr in body {
write!(f, " {}", expr)?;
}
write!(f, ")")
}
Expr::Begin(exprs) => {
write!(f, "(begin")?;
for expr in exprs {
write!(f, " {}", expr)?;
}
write!(f, ")")
}
}
}
}
}
You can now println!("{}", expr) any AST node and get readable output.
8. Parsing Atoms with nom
With atom parsers and the AST defined, this section assembles them into a single parse_atom function that recognises any MiniLisp atom and returns the corresponding Expr variant. You will use alt to try each alternative in turn, learn how nom reports errors and how to interpret them, and write unit tests that verify correct parsing of every atom type.
Connecting parsers to the AST
In §6 we wrote standalone parsers that returned Rust primitives (i64, bool, String). Now we wrap each one with map to produce Expr variants.
Add these to src/parser.rs:
#![allow(unused)]
fn main() {
use nom::IResult;
use nom::branch::alt;
use nom::bytes::complete::tag;
use nom::character::complete::{char, digit1, none_of, one_of};
use nom::combinator::{map, map_res, opt, recognize};
use nom::multi::{many0, many0_count};
use nom::sequence::{delimited, pair};
use crate::ast::Expr;
/// Parse an integer literal and return `Expr::Int`.
fn parse_int(input: &str) -> IResult<&str, Expr> {
map(
map_res(
recognize(pair(opt(char('-')), digit1)),
|s: &str| s.parse::<i64>(),
),
Expr::Int,
)(input)
}
/// Parse a boolean literal and return `Expr::Bool`.
fn parse_bool(input: &str) -> IResult<&str, Expr> {
alt((
map(tag("#t"), |_| Expr::Bool(true)),
map(tag("#f"), |_| Expr::Bool(false)),
))(input)
}
/// Parse a string literal and return `Expr::Str`.
fn parse_str(input: &str) -> IResult<&str, Expr> {
map(
delimited(
char('"'),
map(many0(string_char), |chars| {
chars.into_iter().collect::<String>()
}),
char('"'),
),
Expr::Str,
)(input)
}
fn string_char(input: &str) -> IResult<&str, char> {
alt((
map(tag("\\\\"), |_| '\\'),
map(tag("\\\""), |_| '"'),
map(tag("\\n"), |_| '\n'),
map(tag("\\t"), |_| '\t'),
none_of("\"\\"),
))(input)
}
/// Parse a symbol and return `Expr::Symbol`.
fn parse_symbol(input: &str) -> IResult<&str, Expr> {
map(
map(
recognize(pair(
one_of(
"abcdefghijklmnopqrstuvwxyz\
ABCDEFGHIJKLMNOPQRSTUVWXYZ\
_+-*/=<>!?",
),
many0_count(one_of(
"abcdefghijklmnopqrstuvwxyz\
ABCDEFGHIJKLMNOPQRSTUVWXYZ\
0123456789_+-*/=<>!?",
)),
)),
String::from,
),
Expr::Symbol,
)(input)
}
}
Combining atoms with alt
The parse_atom function tries each atom parser in order:
#![allow(unused)]
fn main() {
/// Parse any atom: integer, boolean, string, or symbol.
pub fn parse_atom(input: &str) -> IResult<&str, Expr> {
alt((parse_bool, parse_str, parse_int, parse_symbol))(input)
}
}
Ordering matters. Booleans are tried first because #t and #f start with #, which no other atom type uses. Strings are next because they start with ". Integers come before symbols because a bare - is a valid symbol — we want -42 to parse as an integer, not as the symbol - followed by 42. If the integer parser fails (no digits after the minus), nom backtracks and the symbol parser gets a chance.
Understanding nom errors
When all alternatives in alt fail, nom returns an error pointing to the position where parsing failed. For example:
#![allow(unused)]
fn main() {
let result = parse_atom("@invalid");
// Err(Err::Error(Error { input: "@invalid", code: ErrorKind::OneOf }))
}
The error says: at position @invalid, the OneOf combinator (inside parse_symbol) could not match any expected character. This is the last alternative that was tried.
During development, you can use nom::error::VerboseError instead of the default error type to get a stack trace of which combinators were attempted. This is helpful for debugging but adds overhead, so switch back to the default for production.
Unit tests
#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn atom_integer() {
assert_eq!(parse_atom("42"), Ok(("", Expr::Int(42))));
assert_eq!(parse_atom("-7 rest"), Ok((" rest", Expr::Int(-7))));
}
#[test]
fn atom_boolean() {
assert_eq!(parse_atom("#t"), Ok(("", Expr::Bool(true))));
assert_eq!(parse_atom("#f"), Ok(("", Expr::Bool(false))));
}
#[test]
fn atom_string() {
assert_eq!(
parse_atom("\"hello\""),
Ok(("", Expr::Str("hello".to_string())))
);
}
#[test]
fn atom_symbol() {
assert_eq!(
parse_atom("foo"),
Ok(("", Expr::Symbol("foo".to_string())))
);
assert_eq!(
parse_atom("+"),
Ok(("", Expr::Symbol("+".to_string())))
);
}
#[test]
fn minus_is_symbol_not_integer() {
// A bare "-" with no digits is the symbol "-"
assert_eq!(
parse_atom("- rest"),
Ok((" rest", Expr::Symbol("-".to_string())))
);
}
}
}
The last test is important: it verifies that a bare minus sign is parsed as a symbol, not as a failed integer parse. This is the ordering logic from alt at work.
9. Parsing S-Expressions and Special Forms
S-expressions are parenthesised lists: the heart of Lisp syntax. This section extends the parser to handle arbitrarily nested lists, whitespace between elements, and comments. It then lifts special forms — define, if, lambda, let, begin — out of the generic list parser so they become distinct AST variants, and covers how to handle recursive parsers in nom without running into borrow-checker problems.
Whitespace and comments
Before parsing lists, we need a combinator that skips whitespace and comments:
#![allow(unused)]
fn main() {
use nom::character::complete::{multispace0, not_line_ending};
/// Skip whitespace and comments.
fn ws(input: &str) -> IResult<&str, ()> {
let mut remaining = input;
loop {
// Skip whitespace
let (r, _) = multispace0(remaining)?;
remaining = r;
// If we see a comment, skip it
if remaining.starts_with(';') {
let (r, _) = not_line_ending(remaining)?;
remaining = r;
} else {
break;
}
}
Ok((remaining, ()))
}
}
This loop handles multiple comments and whitespace in any order.
Parsing a generic list
A list is an opening parenthesis, zero or more whitespace-separated expressions, and a closing parenthesis. The key challenge is that parse_expr (which parses any expression, including lists) must call itself recursively — a list can contain lists.
In nom, recursive parsers require a function boundary. You cannot use closures because closures capture references, creating lifetime issues. Instead, use regular fn items:
#![allow(unused)]
fn main() {
/// Parse any expression: atom or list.
pub fn parse_expr(input: &str) -> IResult<&str, Expr> {
let (input, _) = ws(input)?;
alt((parse_list, parse_atom))(input)
}
/// Parse a parenthesised list of expressions.
fn parse_list(input: &str) -> IResult<&str, Expr> {
let (input, _) = char('(')(input)?;
let (input, _) = ws(input)?;
// Peek at the first symbol to detect special forms
// We'll try special form parsers first
let result = alt((
parse_define,
parse_if,
parse_lambda,
parse_let,
parse_begin,
parse_call,
))(input);
let (input, expr) = result?;
let (input, _) = ws(input)?;
let (input, _) = char(')')(input)?;
Ok((input, expr))
}
}
Parsing special forms
Each special form parser matches the keyword and then parses the form’s specific structure. These parsers consume everything between the parentheses — the opening and closing parens are handled by parse_list.
define — two forms:
#![allow(unused)]
fn main() {
use nom::sequence::preceded;
/// Parse the inside of a `define` form.
fn parse_define(input: &str) -> IResult<&str, Expr> {
let (input, _) = tag("define")(input)?;
let (input, _) = ws(input)?;
// Try function definition form: (define (name params...) body...)
if input.starts_with('(') {
let (input, _) = char('(')(input)?;
let (input, _) = ws(input)?;
let (input, name) = parse_symbol_name(input)?;
let (input, params) = many0(preceded(ws, parse_symbol_name))(input)?;
let (input, _) = ws(input)?;
let (input, _) = char(')')(input)?;
let (input, body) = many1_expr(input)?;
return Ok((input, Expr::DefineFunc(name, params, body)));
}
// Variable definition form: (define name value)
let (input, name) = parse_symbol_name(input)?;
let (input, _) = ws(input)?;
let (input, value) = parse_expr(input)?;
Ok((input, Expr::Define(name, Box::new(value))))
}
/// Parse a bare symbol name (returns String, not Expr).
fn parse_symbol_name(input: &str) -> IResult<&str, String> {
map(
recognize(pair(
one_of(
"abcdefghijklmnopqrstuvwxyz\
ABCDEFGHIJKLMNOPQRSTUVWXYZ\
_+-*/=<>!?",
),
many0_count(one_of(
"abcdefghijklmnopqrstuvwxyz\
ABCDEFGHIJKLMNOPQRSTUVWXYZ\
0123456789_+-*/=<>!?",
)),
)),
String::from,
)(input)
}
/// Parse one or more expressions (used for body forms).
fn many1_expr(input: &str) -> IResult<&str, Vec<Expr>> {
let (input, first) = preceded(ws, parse_expr)(input)?;
let (input, mut rest) = many0(preceded(ws, parse_expr))(input)?;
rest.insert(0, first);
Ok((input, rest))
}
}
if — exactly three sub-expressions:
#![allow(unused)]
fn main() {
fn parse_if(input: &str) -> IResult<&str, Expr> {
let (input, _) = tag("if")(input)?;
let (input, _) = ws(input)?;
let (input, cond) = parse_expr(input)?;
let (input, _) = ws(input)?;
let (input, then) = parse_expr(input)?;
let (input, _) = ws(input)?;
let (input, els) = parse_expr(input)?;
Ok((
input,
Expr::If(Box::new(cond), Box::new(then), Box::new(els)),
))
}
}
lambda — parameter list and body:
#![allow(unused)]
fn main() {
fn parse_lambda(input: &str) -> IResult<&str, Expr> {
let (input, _) = tag("lambda")(input)?;
let (input, _) = ws(input)?;
let (input, _) = char('(')(input)?;
let (input, params) = many0(preceded(ws, parse_symbol_name))(input)?;
let (input, _) = ws(input)?;
let (input, _) = char(')')(input)?;
let (input, body) = many1_expr(input)?;
Ok((input, Expr::Lambda(params, body)))
}
}
let — binding list and body:
#![allow(unused)]
fn main() {
fn parse_let(input: &str) -> IResult<&str, Expr> {
let (input, _) = tag("let")(input)?;
let (input, _) = ws(input)?;
let (input, _) = char('(')(input)?;
let (input, bindings) = many0(preceded(ws, parse_binding))(input)?;
let (input, _) = ws(input)?;
let (input, _) = char(')')(input)?;
let (input, body) = many1_expr(input)?;
Ok((input, Expr::Let(bindings, body)))
}
fn parse_binding(input: &str) -> IResult<&str, (String, Expr)> {
let (input, _) = char('(')(input)?;
let (input, _) = ws(input)?;
let (input, name) = parse_symbol_name(input)?;
let (input, _) = ws(input)?;
let (input, val) = parse_expr(input)?;
let (input, _) = ws(input)?;
let (input, _) = char(')')(input)?;
Ok((input, (name, val)))
}
}
begin — one or more expressions:
#![allow(unused)]
fn main() {
fn parse_begin(input: &str) -> IResult<&str, Expr> {
let (input, _) = tag("begin")(input)?;
let (input, exprs) = many1_expr(input)?;
Ok((input, Expr::Begin(exprs)))
}
}
Generic function call — anything else:
#![allow(unused)]
fn main() {
fn parse_call(input: &str) -> IResult<&str, Expr> {
let (input, func) = parse_expr(input)?;
let (input, args) = many0(preceded(ws, parse_expr))(input)?;
Ok((input, Expr::Call(Box::new(func), args)))
}
}
Parsing a program
A program is zero or more top-level expressions:
#![allow(unused)]
fn main() {
/// Parse a complete MiniLisp program.
pub fn parse_program(input: &str) -> IResult<&str, Vec<Expr>> {
let (input, _) = ws(input)?;
let (input, exprs) = many0(preceded(ws, parse_expr))(input)?;
let (input, _) = ws(input)?;
Ok((input, exprs))
}
}
Handling recursion and the borrow checker
The recursive call from parse_list → parse_expr → parse_list works because both are plain fn items, not closures. nom processes &str slices, so there are no ownership issues — each parser borrows the input, consumes some of it, and returns the remainder.
If you find yourself wanting to use closures for recursive parsers, the workaround is to wrap the recursive call in a helper function or use nom::combinator::cut to improve error messages at the boundary.
Tests
#![allow(unused)]
fn main() {
#[test]
fn parse_simple_call() {
let (_, expr) = parse_expr("(+ 1 2)").unwrap();
assert_eq!(
expr,
Expr::Call(
Box::new(Expr::Symbol("+".to_string())),
vec![Expr::Int(1), Expr::Int(2)],
)
);
}
#[test]
fn parse_nested() {
let (_, expr) = parse_expr("(* 3 (+ 1 2))").unwrap();
match expr {
Expr::Call(_, args) => assert_eq!(args.len(), 2),
_ => panic!("expected Call"),
}
}
#[test]
fn parse_define_var() {
let (_, expr) = parse_expr("(define x 42)").unwrap();
assert_eq!(
expr,
Expr::Define("x".to_string(), Box::new(Expr::Int(42)))
);
}
#[test]
fn parse_define_func() {
let (_, expr) = parse_expr("(define (f x) x)").unwrap();
match expr {
Expr::DefineFunc(name, params, _) => {
assert_eq!(name, "f");
assert_eq!(params, vec!["x".to_string()]);
}
_ => panic!("expected DefineFunc"),
}
}
#[test]
fn parse_if_expr() {
let (_, expr) = parse_expr("(if #t 1 0)").unwrap();
match expr {
Expr::If(_, _, _) => {} // ok
_ => panic!("expected If"),
}
}
#[test]
fn parse_with_comments() {
let input = "; comment\n(+ 1 2)";
let (_, expr) = parse_expr(input).unwrap();
match expr {
Expr::Call(_, _) => {} // ok
_ => panic!("expected Call"),
}
}
}
Part 3 — Semantic Analysis
10. Symbol Tables and Scope
A symbol table maps names to their definitions. This section walks through a scope-aware traversal of the AST that builds a symbol table, resolves every symbol reference to its definition, and reports helpful errors for undefined names or names used outside their scope. You will implement a simple environment chain — the standard technique for representing nested lexical scopes.
What is a symbol table?
A symbol table answers the question: “when the program says x, what does x refer to?” In MiniLisp, x could be a top-level definition, a function parameter, or a local let binding. The symbol table tracks every name that is in scope at every point in the program.
Environment chains
An environment chain is a stack of scopes. Each scope is a set of names. When you look up a name, you search from the innermost scope outward. If no scope contains the name, it is undefined.
#![allow(unused)]
fn main() {
use std::collections::HashSet;
/// A chain of lexical scopes.
pub struct Env {
scopes: Vec<HashSet<String>>,
}
impl Env {
/// Create a new environment with a single empty global scope.
pub fn new() -> Self {
Env {
scopes: vec![HashSet::new()],
}
}
/// Push a new scope (entering a function, let, etc.).
pub fn push_scope(&mut self) {
self.scopes.push(HashSet::new());
}
/// Pop the innermost scope (leaving a function, let, etc.).
pub fn pop_scope(&mut self) {
self.scopes.pop();
}
/// Define a name in the current (innermost) scope.
pub fn define(&mut self, name: &str) {
if let Some(scope) = self.scopes.last_mut() {
scope.insert(name.to_string());
}
}
/// Check whether a name is defined in any enclosing scope.
pub fn is_defined(&self, name: &str) -> bool {
self.scopes.iter().rev().any(|scope| scope.contains(name))
}
}
}
Why a Vec<HashSet>? Each HashSet is one scope. Pushing and popping is O(1). Lookup walks the vector from back to front, which is O(number of scopes) — for typical programs with a few levels of nesting, this is trivially fast.
Built-in names
The built-in operators (+, -, *, /, =, <, >, <=, >=, not, and, or, display, newline) are always in scope. Define them in the global scope at startup:
#![allow(unused)]
fn main() {
impl Env {
/// Create an environment with built-in names pre-defined.
pub fn with_builtins() -> Self {
let mut env = Env::new();
for name in &[
"+", "-", "*", "/", "=", "<", ">", "<=", ">=",
"not", "and", "or", "display", "newline",
] {
env.define(name);
}
env
}
}
}
Walking the AST
The analyser traverses the AST recursively, maintaining the environment as it goes. For each node:
- Symbol — check that the name is defined in the current environment. If not, record an error.
- Define / DefineFunc — add the name to the current scope, then check the body.
- Lambda — push a new scope, add parameters, check the body, pop the scope.
- Let — push a new scope, evaluate each binding’s value in the outer scope (this is
let, notlet*), add all binding names to the inner scope, check the body, pop the scope. - If, Call, Begin — recursively check all sub-expressions.
#![allow(unused)]
fn main() {
use crate::ast::Expr;
/// Check all expressions for scope errors.
/// Returns a list of error messages (empty means success).
pub fn check_scope(exprs: &[Expr]) -> Vec<String> {
let mut env = Env::with_builtins();
let mut errors = Vec::new();
for expr in exprs {
check_expr(expr, &mut env, &mut errors);
}
errors
}
fn check_expr(expr: &Expr, env: &mut Env, errors: &mut Vec<String>) {
match expr {
Expr::Int(_) | Expr::Bool(_) | Expr::Str(_) => {
// Atoms are always valid
}
Expr::Symbol(name) => {
if !env.is_defined(name) {
errors.push(format!("undefined symbol: {}", name));
}
}
Expr::Call(func, args) => {
check_expr(func, env, errors);
for arg in args {
check_expr(arg, env, errors);
}
}
Expr::Define(name, value) => {
check_expr(value, env, errors);
env.define(name);
}
Expr::DefineFunc(name, params, body) => {
env.define(name); // allow recursion
env.push_scope();
for p in params {
env.define(p);
}
for expr in body {
check_expr(expr, env, errors);
}
env.pop_scope();
}
Expr::If(cond, then, els) => {
check_expr(cond, env, errors);
check_expr(then, env, errors);
check_expr(els, env, errors);
}
Expr::Lambda(params, body) => {
env.push_scope();
for p in params {
env.define(p);
}
for expr in body {
check_expr(expr, env, errors);
}
env.pop_scope();
}
Expr::Let(bindings, body) => {
// Evaluate binding values in the outer scope
for (_, val) in bindings {
check_expr(val, env, errors);
}
// Then introduce binding names in a new scope
env.push_scope();
for (name, _) in bindings {
env.define(name);
}
for expr in body {
check_expr(expr, env, errors);
}
env.pop_scope();
}
Expr::Begin(exprs) => {
for expr in exprs {
check_expr(expr, env, errors);
}
}
}
}
}
Error reporting
Notice that DefineFunc defines the function name before checking the body. This allows recursive functions to refer to themselves. Without this, (define (fact n) ... (fact (- n 1))) would report fact as undefined inside its own body.
The analyser collects all errors before stopping, so the programmer sees every problem at once:
undefined symbol: foo
undefined symbol: bar
Tests
#![allow(unused)]
fn main() {
#[test]
fn undefined_symbol() {
let exprs = vec![Expr::Symbol("x".to_string())];
let errors = check_scope(&exprs);
assert_eq!(errors, vec!["undefined symbol: x"]);
}
#[test]
fn defined_symbol() {
let exprs = vec![
Expr::Define("x".to_string(), Box::new(Expr::Int(1))),
Expr::Symbol("x".to_string()),
];
let errors = check_scope(&exprs);
assert!(errors.is_empty());
}
#[test]
fn builtin_is_defined() {
let exprs = vec![Expr::Symbol("+".to_string())];
let errors = check_scope(&exprs);
assert!(errors.is_empty());
}
#[test]
fn lambda_scope() {
// (lambda (x) x) — x is defined inside the lambda
let exprs = vec![Expr::Lambda(
vec!["x".to_string()],
vec![Expr::Symbol("x".to_string())],
)];
let errors = check_scope(&exprs);
assert!(errors.is_empty());
}
#[test]
fn lambda_scope_does_not_leak() {
// (lambda (x) x) then reference x — x should be undefined
let exprs = vec![
Expr::Lambda(
vec!["x".to_string()],
vec![Expr::Symbol("x".to_string())],
),
Expr::Symbol("x".to_string()),
];
let errors = check_scope(&exprs);
assert_eq!(errors, vec!["undefined symbol: x"]);
}
}
11. Checking Special Forms
Special forms have fixed shapes: if needs exactly three sub-expressions; define needs a name and a body; lambda needs a parameter list and at least one body expression. This section adds arity and shape checks for each special form so that malformed programs produce clear error messages rather than mysterious C output.
Why check form shapes?
The parser already enforces some structure — it will not produce an If node with two sub-expressions because the parser expects exactly three. However, the parser’s error messages for malformed input can be cryptic (nom reports parsing failures, not semantic problems). The analyser provides a second line of defence with clear, domain-specific error messages.
More importantly, if the AST is ever constructed programmatically (for example, by a macro expander), the analyser catches structural errors that the parser would have prevented.
Checks to implement
| Form | Rule | Error message |
|---|---|---|
Define | value must not be empty | define: missing value |
DefineFunc | must have at least one body expression | define: function body is empty |
If | must have exactly condition, then, else | (enforced by AST structure) |
Lambda | must have at least one body expression | lambda: body is empty |
Let | must have at least one body expression | let: body is empty |
Let | each binding must have exactly a name and value | (enforced by AST structure) |
Begin | must have at least one expression | begin: empty begin block |
Several of these are enforced by the AST’s type structure — for example, If always has exactly three Box<Expr> fields, so you cannot construct a two-armed If. The checks below handle the cases that the type system does not catch.
Implementation
Add form validation to src/analysis.rs, integrated with the scope checker:
#![allow(unused)]
fn main() {
/// Validate the shape of special forms.
fn check_forms(expr: &Expr, errors: &mut Vec<String>) {
match expr {
Expr::DefineFunc(name, _, body) => {
if body.is_empty() {
errors.push(format!(
"define: function '{}' body is empty", name
));
}
for e in body {
check_forms(e, errors);
}
}
Expr::Lambda(_, body) => {
if body.is_empty() {
errors.push("lambda: body is empty".to_string());
}
for e in body {
check_forms(e, errors);
}
}
Expr::Let(bindings, body) => {
if body.is_empty() {
errors.push("let: body is empty".to_string());
}
for (_, val) in bindings {
check_forms(val, errors);
}
for e in body {
check_forms(e, errors);
}
}
Expr::Begin(exprs) => {
if exprs.is_empty() {
errors.push("begin: empty begin block".to_string());
}
for e in exprs {
check_forms(e, errors);
}
}
Expr::Define(_, val) => {
check_forms(val, errors);
}
Expr::If(cond, then, els) => {
check_forms(cond, errors);
check_forms(then, errors);
check_forms(els, errors);
}
Expr::Call(func, args) => {
check_forms(func, errors);
for a in args {
check_forms(a, errors);
}
}
Expr::Int(_) | Expr::Bool(_) | Expr::Str(_) | Expr::Symbol(_) => {}
}
}
}
Combining scope and form checks
The public analyse function runs both passes:
#![allow(unused)]
fn main() {
/// Run all semantic checks on a parsed program.
pub fn analyse(exprs: &[Expr]) -> Result<(), Vec<String>> {
let mut errors = Vec::new();
// Pass 1: scope checking
errors.extend(check_scope(exprs));
// Pass 2: form validation
for expr in exprs {
check_forms(expr, &mut errors);
}
if errors.is_empty() {
Ok(())
} else {
Err(errors)
}
}
}
Tests
#![allow(unused)]
fn main() {
#[test]
fn empty_begin() {
let exprs = vec![Expr::Begin(vec![])];
let result = analyse(&exprs);
assert!(result.is_err());
let errs = result.unwrap_err();
assert!(errs.iter().any(|e| e.contains("empty begin block")));
}
#[test]
fn empty_lambda_body() {
let exprs = vec![Expr::Lambda(vec!["x".to_string()], vec![])];
let result = analyse(&exprs);
assert!(result.is_err());
let errs = result.unwrap_err();
assert!(errs.iter().any(|e| e.contains("lambda: body is empty")));
}
#[test]
fn valid_program() {
let exprs = vec![
Expr::Define("x".to_string(), Box::new(Expr::Int(10))),
Expr::Call(
Box::new(Expr::Symbol("display".to_string())),
vec![Expr::Symbol("x".to_string())],
),
];
assert!(analyse(&exprs).is_ok());
}
}
Part 4 — Code Generation
12. The C Runtime Preamble
Every MiniLisp program compiles to a C file that begins with a standard preamble: #include directives, type aliases, boolean constants, and thin wrappers for built-in operations like display and newline. This section designs the preamble, explains why each piece is there, and shows how the code generator emits it before any user-defined code.
Why a preamble?
MiniLisp has built-in operations — display, newline, arithmetic, comparisons — that the generated C code calls. These need C implementations. Rather than linking a separate runtime library, we emit the implementations directly at the top of the C file. This makes every generated file self-contained: you compile it with cc -o out out.c and it works.
The complete preamble
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
/* Boolean type */
typedef int bool_t;
#define TRUE 1
#define FALSE 0
/* Display functions */
static void display_int(long long x) {
printf("%lld", x);
}
static void display_bool(bool_t x) {
printf("%s", x ? "#t" : "#f");
}
static void display_str(const char *s) {
printf("%s", s);
}
static void newline_() {
printf("\n");
}
Line by line:
#include <stdio.h>— forprintf.#include <stdlib.h>— forexit, used by error-handling code.#include <string.h>— for any string operations.typedef int bool_t— MiniLisp booleans become C integers.TRUEandFALSEare macros for readability.display_int,display_bool,display_str— MiniLisp’sdisplayis polymorphic (it can print any value). Since our compiler knows the type of each expression at compile time (integers, booleans, and strings are the only printable types), it calls the correctdisplay_*variant directly.newline_— the trailing underscore avoids clashing with C’snewlineif one exists in the environment. This is a pattern we use throughout: MiniLisp names are mangled to avoid C keyword and standard library conflicts.
Emitting the preamble
In src/codegen.rs, define the preamble as a constant:
#![allow(unused)]
fn main() {
const PREAMBLE: &str = r#"#include <stdio.h>
#include <stdlib.h>
#include <string.h>
typedef int bool_t;
#define TRUE 1
#define FALSE 0
static void display_int(long long x) {
printf("%lld", x);
}
static void display_bool(bool_t x) {
printf("%s", x ? "#t" : "#f");
}
static void display_str(const char *s) {
printf("%s", s);
}
static void newline_() {
printf("\n");
}
"#;
}
The code generator begins every output file with this string. User-defined code follows immediately after.
Design decision: monomorphic display
A more sophisticated compiler would use a tagged union (a “value” type) so that display could handle any value at runtime. We avoid this because:
- It adds significant complexity (you need a value representation, tag checks, and memory management).
- Our compiler knows at AST level whether an expression is an integer, boolean, or string, so it can emit the correct
display_*call directly. - The goal is simplicity and clarity — you can extend this later if you want dynamic typing.
13. Generating C: Atoms and Expressions
This section implements the expression code generator — the recursive function that turns an Expr into a C expression string. Integers become C integer literals; booleans become TRUE and FALSE; strings become string literals; arithmetic and comparison operations become C operators; function calls become C function-call syntax. You will also handle name-mangling: turning Lisp symbols like my-var into valid C identifiers.
Name mangling
MiniLisp symbols can contain characters that are not valid in C identifiers: -, +, *, /, =, <, >, !, ?. We replace each with a safe suffix:
#![allow(unused)]
fn main() {
/// Convert a MiniLisp symbol into a valid C identifier.
fn mangle(name: &str) -> String {
let mut out = String::new();
for ch in name.chars() {
match ch {
'-' => out.push_str("_dash_"),
'+' => out.push_str("_plus_"),
'*' => out.push_str("_star_"),
'/' => out.push_str("_slash_"),
'=' => out.push_str("_eq_"),
'<' => out.push_str("_lt_"),
'>' => out.push_str("_gt_"),
'!' => out.push_str("_bang_"),
'?' => out.push_str("_qmark_"),
c => out.push(c),
}
}
out
}
}
Examples:
| MiniLisp | C identifier |
|---|---|
x | x |
my-var | my_dash_var |
<= | _lt__eq_ |
zero? | zero_qmark_ |
Generating atoms
#![allow(unused)]
fn main() {
/// Generate a C expression from an AST node.
fn gen_expr(expr: &Expr) -> String {
match expr {
Expr::Int(n) => format!("{}LL", n),
Expr::Bool(b) => if *b { "TRUE".to_string() } else { "FALSE".to_string() },
Expr::Str(s) => format!("\"{}\"", s.replace('\\', "\\\\")
.replace('"', "\\\"")
.replace('\n', "\\n")
.replace('\t', "\\t")),
Expr::Symbol(name) => mangle(name),
// ... other variants handled below
_ => todo!(),
}
}
}
Details:
- Integers get the
LLsuffix to ensure they arelong longin C, matching ourdisplay_intsignature. - Booleans become the macros
TRUEandFALSEfrom the preamble. - Strings are re-escaped for C. The MiniLisp parser already resolved escape sequences into actual characters, so we must re-escape them for the C source.
- Symbols are mangled into valid C identifiers.
Generating arithmetic and comparisons
Built-in binary operators compile to infix C operators:
#![allow(unused)]
fn main() {
Expr::Call(func, args) => {
if let Expr::Symbol(name) = func.as_ref() {
match name.as_str() {
// Arithmetic
"+" => format!("({} + {})", gen_expr(&args[0]), gen_expr(&args[1])),
"-" => format!("({} - {})", gen_expr(&args[0]), gen_expr(&args[1])),
"*" => format!("({} * {})", gen_expr(&args[0]), gen_expr(&args[1])),
"/" => format!("({} / {})", gen_expr(&args[0]), gen_expr(&args[1])),
// Comparisons
"=" => format!("({} == {})", gen_expr(&args[0]), gen_expr(&args[1])),
"<" => format!("({} < {})", gen_expr(&args[0]), gen_expr(&args[1])),
">" => format!("({} > {})", gen_expr(&args[0]), gen_expr(&args[1])),
"<=" => format!("({} <= {})", gen_expr(&args[0]), gen_expr(&args[1])),
">=" => format!("({} >= {})", gen_expr(&args[0]), gen_expr(&args[1])),
// Boolean operators
"not" => format!("(!{})", gen_expr(&args[0])),
"and" => format!("({} && {})", gen_expr(&args[0]), gen_expr(&args[1])),
"or" => format!("({} || {})", gen_expr(&args[0]), gen_expr(&args[1])),
// Display
"display" => gen_display(&args[0]),
"newline" => "newline_()".to_string(),
// User-defined function call
_ => {
let mangled = mangle(name);
let arg_strs: Vec<String> = args.iter().map(gen_expr).collect();
format!("{}({})", mangled, arg_strs.join(", "))
}
}
} else {
// Calling a non-symbol expression (e.g., a lambda)
let func_str = gen_expr(func);
let arg_strs: Vec<String> = args.iter().map(gen_expr).collect();
format!("{}({})", func_str, arg_strs.join(", "))
}
}
}
Each arithmetic and comparison operation is wrapped in parentheses to preserve correct precedence in the C output.
Generating display calls
display in MiniLisp is polymorphic, but our compiler knows the type statically. We dispatch based on the argument’s AST type:
#![allow(unused)]
fn main() {
fn gen_display(expr: &Expr) -> String {
match expr {
Expr::Int(_) => format!("display_int({})", gen_expr(expr)),
Expr::Bool(_) => format!("display_bool({})", gen_expr(expr)),
Expr::Str(_) => format!("display_str({})", gen_expr(expr)),
// For expressions whose type we can't determine statically,
// default to display_int (integers are the most common case)
_ => format!("display_int({})", gen_expr(expr)),
}
}
}
This is a simplification. A production compiler would either track types through the AST or use a tagged union. For our purposes, defaulting to display_int for computed expressions (like (display (+ 1 2))) is correct because all our arithmetic produces integers.
Tests
#![allow(unused)]
fn main() {
#[test]
fn gen_integer() {
assert_eq!(gen_expr(&Expr::Int(42)), "42LL");
}
#[test]
fn gen_bool() {
assert_eq!(gen_expr(&Expr::Bool(true)), "TRUE");
}
#[test]
fn gen_addition() {
let expr = Expr::Call(
Box::new(Expr::Symbol("+".to_string())),
vec![Expr::Int(1), Expr::Int(2)],
);
assert_eq!(gen_expr(&expr), "(1LL + 2LL)");
}
#[test]
fn mangle_hyphen() {
assert_eq!(mangle("my-var"), "my_dash_var");
}
}
14. Generating C: Definitions and Functions
Top-level define forms and lambda expressions compile to C function and variable declarations. This section covers how to emit forward declarations (so mutual recursion works), how to turn a MiniLisp parameter list into a C function signature, how lambda compiles to a named C function, and how top-level definitions are ordered in the output file.
The output file structure
The generated C file has four sections, in this order:
/* 1. Preamble (§12) */
#include <stdio.h>
// ...
/* 2. Forward declarations */
long long square(long long x);
long long fact(long long n);
/* 3. Function definitions */
long long square(long long x) {
return (x * x);
}
long long fact(long long n) {
return (n == 0LL) ? 1LL : (n * fact((n - 1LL)));
}
/* 4. main() — top-level expressions */
int main() {
display_int(square(5LL));
newline_();
return 0;
}
Forward declarations allow functions to call each other regardless of definition order. This mirrors how MiniLisp works — order of definitions does not matter for function calls.
Generating function definitions
A DefineFunc compiles to a C function. All parameters and return values use long long for simplicity (MiniLisp does not have a type system, and integers are the primary data type):
#![allow(unused)]
fn main() {
fn gen_func_decl(name: &str, params: &[String]) -> String {
let mangled_name = mangle(name);
let param_list: Vec<String> = params
.iter()
.map(|p| format!("long long {}", mangle(p)))
.collect();
format!("long long {}({})", mangled_name, param_list.join(", "))
}
fn gen_func_def(name: &str, params: &[String], body: &[Expr]) -> String {
let decl = gen_func_decl(name, params);
let body_strs: Vec<String> = body.iter().map(gen_expr).collect();
// All body expressions except the last are statements;
// the last is the return value
let mut lines = Vec::new();
for (i, s) in body_strs.iter().enumerate() {
if i == body_strs.len() - 1 {
lines.push(format!(" return {};", s));
} else {
lines.push(format!(" {};", s));
}
}
format!("{} {{\n{}\n}}", decl, lines.join("\n"))
}
}
For example, (define (square x) (* x x)) becomes:
long long square(long long x) {
return (x * x);
}
Generating variable definitions
A Define (variable, not function) compiles to a global variable:
#![allow(unused)]
fn main() {
fn gen_var_def(name: &str, value: &Expr) -> String {
format!("long long {} = {};", mangle(name), gen_expr(value))
}
}
For example, (define pi 3) becomes:
long long pi = 3LL;
Lambda lifting
A lambda expression needs a name in C — C does not have anonymous functions. We generate a unique name for each lambda:
#![allow(unused)]
fn main() {
use std::sync::atomic::{AtomicUsize, Ordering};
static LAMBDA_COUNTER: AtomicUsize = AtomicUsize::new(0);
fn fresh_lambda_name() -> String {
let n = LAMBDA_COUNTER.fetch_add(1, Ordering::SeqCst);
format!("__lambda_{}", n)
}
}
When the code generator encounters a Lambda, it:
- Generates a named function definition using the fresh name.
- Adds it to a list of “lifted” functions that will be emitted before
main. - Returns the function’s name as the C expression.
#![allow(unused)]
fn main() {
fn gen_lambda(
params: &[String],
body: &[Expr],
lifted: &mut Vec<String>,
) -> String {
let name = fresh_lambda_name();
let def = gen_func_def(&name, params, body);
lifted.push(def);
name
}
}
Putting function generation together
The top-level generation function separates definitions from expressions:
#![allow(unused)]
fn main() {
pub fn generate(exprs: &[Expr]) -> String {
let mut forward_decls = Vec::new();
let mut func_defs = Vec::new();
let mut main_stmts = Vec::new();
let mut lifted = Vec::new();
for expr in exprs {
match expr {
Expr::DefineFunc(name, params, body) => {
forward_decls.push(format!("{};", gen_func_decl(name, params)));
func_defs.push(gen_func_def(name, params, body));
}
Expr::Define(name, value) => {
main_stmts.push(format!(
"long long {} = {};",
mangle(name),
gen_expr(value)
));
}
_ => {
main_stmts.push(format!("{};", gen_expr(expr)));
}
}
}
let mut output = String::new();
output.push_str(PREAMBLE);
for decl in &forward_decls {
output.push_str(decl);
output.push('\n');
}
output.push('\n');
for def in &func_defs {
output.push_str(def);
output.push_str("\n\n");
}
for lf in &lifted {
output.push_str(lf);
output.push_str("\n\n");
}
output.push_str("int main() {\n");
for stmt in &main_stmts {
output.push_str(" ");
output.push_str(stmt);
output.push('\n');
}
output.push_str(" return 0;\n");
output.push_str("}\n");
output
}
}
15. Generating C: Control Flow and Sequencing
if, begin, and let each require their own code-generation strategy. if becomes a C ternary expression or an if/else statement depending on context; begin becomes a sequence of C statements with the last value forwarded; let introduces a C block with local variable declarations. This section works through each form and resolves the practical question of when to emit expressions versus statements.
The expression vs. statement problem
C distinguishes expressions (produce a value) from statements (do not). MiniLisp does not — everything is an expression. This creates a tension: sometimes a MiniLisp form appears where a C expression is needed (e.g., as a function argument), and sometimes it appears where a C statement is fine (e.g., at the top level).
Our strategy: generate expressions wherever possible, falling back to statement blocks only when necessary. The ternary operator ? : lets us keep if as an expression. begin and let use GCC’s “statement expressions” extension ({ ... }) to remain in expression position.
Generating if
An if compiles to a C ternary expression:
#![allow(unused)]
fn main() {
Expr::If(cond, then, els) => {
format!(
"({} ? {} : {})",
gen_expr(cond),
gen_expr(then),
gen_expr(els),
)
}
}
MiniLisp (if (> x 0) x (- 0 x)) becomes C ((x > 0LL) ? x : (0LL - x)).
The ternary operator works because our if always has both branches. If we supported if without an else branch, we would need a different strategy.
Generating begin
begin evaluates a sequence of expressions and returns the value of the last one. We use GCC’s statement-expression extension:
#![allow(unused)]
fn main() {
Expr::Begin(exprs) => {
let mut parts = Vec::new();
for (i, e) in exprs.iter().enumerate() {
if i == exprs.len() - 1 {
parts.push(gen_expr(e));
} else {
parts.push(format!("{};", gen_expr(e)));
}
}
format!("({{ {} }})", parts.join(" "))
}
}
MiniLisp:
(begin
(display 1)
(newline)
42)
Becomes C:
({ display_int(1LL); newline_(); 42LL; })
The statement-expression ({ ... }) evaluates each statement in order and produces the value of the last expression. This is a GCC/Clang extension, not standard C. For portability, you could instead generate a helper function, but the extension keeps things simple.
Generating let
let introduces local variables. We generate a statement-expression with local declarations:
#![allow(unused)]
fn main() {
Expr::Let(bindings, body) => {
let mut parts = Vec::new();
for (name, val) in bindings {
parts.push(format!(
"long long {} = {};",
mangle(name),
gen_expr(val),
));
}
for (i, e) in body.iter().enumerate() {
if i == body.len() - 1 {
parts.push(gen_expr(e));
} else {
parts.push(format!("{};", gen_expr(e)));
}
}
format!("({{ {} }})", parts.join(" "))
}
}
MiniLisp:
(let ((x 10)
(y 20))
(+ x y))
Becomes C:
({ long long x = 10LL; long long y = 20LL; (x + y); })
Complete gen_expr
Here is the full gen_expr function with all variants:
#![allow(unused)]
fn main() {
fn gen_expr(expr: &Expr) -> String {
match expr {
Expr::Int(n) => format!("{}LL", n),
Expr::Bool(b) => if *b { "TRUE".to_string() } else { "FALSE".to_string() },
Expr::Str(s) => {
let escaped = s.replace('\\', "\\\\")
.replace('"', "\\\"")
.replace('\n', "\\n")
.replace('\t', "\\t");
format!("\"{}\"", escaped)
}
Expr::Symbol(name) => mangle(name),
Expr::Call(func, args) => {
// ... (as shown in §13)
gen_call(func, args)
}
Expr::If(cond, then, els) => {
format!("({} ? {} : {})",
gen_expr(cond), gen_expr(then), gen_expr(els))
}
Expr::Begin(exprs) => {
let mut parts = Vec::new();
for (i, e) in exprs.iter().enumerate() {
if i == exprs.len() - 1 {
parts.push(gen_expr(e));
} else {
parts.push(format!("{};", gen_expr(e)));
}
}
format!("({{ {} }})", parts.join(" "))
}
Expr::Let(bindings, body) => {
let mut parts = Vec::new();
for (name, val) in bindings {
parts.push(format!("long long {} = {};",
mangle(name), gen_expr(val)));
}
for (i, e) in body.iter().enumerate() {
if i == body.len() - 1 {
parts.push(gen_expr(e));
} else {
parts.push(format!("{};", gen_expr(e)));
}
}
format!("({{ {} }})", parts.join(" "))
}
Expr::Lambda(params, body) => {
// Lambda lifting is handled at the top level;
// if we encounter a lambda in expression position,
// we would need to lift it. For simplicity, we
// generate an inline comment.
let name = fresh_lambda_name();
format!("/* lambda {} */", name)
}
Expr::Define(name, val) => {
format!("long long {} = {}", mangle(name), gen_expr(val))
}
Expr::DefineFunc(_, _, _) => {
// Function definitions are handled at the top level
String::new()
}
}
}
}
Tests
#![allow(unused)]
fn main() {
#[test]
fn gen_if() {
let expr = Expr::If(
Box::new(Expr::Bool(true)),
Box::new(Expr::Int(1)),
Box::new(Expr::Int(0)),
);
assert_eq!(gen_expr(&expr), "(TRUE ? 1LL : 0LL)");
}
#[test]
fn gen_let() {
let expr = Expr::Let(
vec![("x".to_string(), Expr::Int(10))],
vec![Expr::Symbol("x".to_string())],
);
assert_eq!(gen_expr(&expr), "({ long long x = 10LL; x })");
}
#[test]
fn gen_begin() {
let expr = Expr::Begin(vec![Expr::Int(1), Expr::Int(2)]);
assert_eq!(gen_expr(&expr), "({ 1LL; 2LL })");
}
}
Part 5 — Putting It Together
16. The Compilation Pipeline
With all stages implemented, this section wires them into a single compile function and builds a CLI entry point that reads MiniLisp from a file or stdin and writes C to stdout or a file. You will add basic error reporting that shows the source location of each failure and trace a complete example — a recursive factorial function — through every stage.
The compile function
#![allow(unused)]
fn main() {
// In src/main.rs
mod ast;
mod parser;
mod analysis;
mod codegen;
use std::io::{self, Read};
use std::process;
fn compile(source: &str) -> Result<String, Vec<String>> {
// Stage 1: Parse
let exprs = match parser::parse_program(source) {
Ok(("", exprs)) => exprs,
Ok((remaining, _)) => {
return Err(vec![format!(
"parse error: unexpected trailing input: {}",
&remaining[..remaining.len().min(40)]
)]);
}
Err(e) => {
return Err(vec![format!("parse error: {}", e)]);
}
};
// Stage 2: Analyse
analysis::analyse(&exprs)?;
// Stage 3: Generate
Ok(codegen::generate(&exprs))
}
}
The function takes MiniLisp source as a string and returns either the generated C code or a list of error messages.
The CLI entry point
fn main() {
let mut input = String::new();
io::stdin()
.read_to_string(&mut input)
.expect("failed to read stdin");
match compile(&input) {
Ok(c_code) => print!("{}", c_code),
Err(errors) => {
for err in &errors {
eprintln!("error: {}", err);
}
process::exit(1);
}
}
}
Usage:
# Compile MiniLisp to C
echo '(display 42) (newline)' | cargo run > out.c
# Compile and run the C
cc -o out out.c && ./out
42
Tracing an example: factorial
Let us trace (define (fact n) (if (= n 0) 1 (* n (fact (- n 1))))) (display (fact 10)) (newline) through every stage.
Stage 1: Parse
The parser produces:
#![allow(unused)]
fn main() {
[
DefineFunc("fact", ["n"], [
If(
Call(Symbol("="), [Symbol("n"), Int(0)]),
Int(1),
Call(Symbol("*"), [
Symbol("n"),
Call(Symbol("fact"), [
Call(Symbol("-"), [Symbol("n"), Int(1)])
])
])
)
]),
Call(Symbol("display"), [
Call(Symbol("fact"), [Int(10)])
]),
Call(Symbol("newline"), []),
]
}
Stage 2: Analyse
The analyser checks:
factis defined (byDefineFunc), so the recursive call is valid.nis a parameter offact, so it is in scope inside the body.=,*,-,display,newlineare built-in — always in scope.- The
ifhas three sub-expressions. All forms are well-shaped.
Result: no errors.
Stage 3: Generate
The code generator produces:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
typedef int bool_t;
#define TRUE 1
#define FALSE 0
// ... (display functions, etc.)
long long fact(long long n);
long long fact(long long n) {
return ((n == 0LL) ? 1LL : (n * fact((n - 1LL))));
}
int main() {
display_int(fact(10LL));
newline_();
return 0;
}
Compile and run:
$ cc -o fact out.c && ./fact
3628800
Error reporting
When the compiler encounters errors, it prints them to stderr and exits with code 1:
$ echo '(display x)' | cargo run
error: undefined symbol: x
$ echo '(if #t 1)' | cargo run
error: parse error: ...
The parser error will mention the position where parsing failed. The semantic analyser’s errors identify the problematic symbol or form by name.
17. Testing the Compiler
Good tests are what turn a working prototype into a reliable tool. This section adds unit tests for each compiler stage and integration tests that compile MiniLisp programs, feed the C output to cc, run the binary, and assert on stdout. You will build a small test corpus of MiniLisp programs covering all language features and ensure the compiler handles both valid and invalid input gracefully.
Unit test structure
Each module has inline unit tests in a #[cfg(test)] module. We have already seen examples throughout the course. Here is a summary of what to test in each module:
ast.rs — test Display:
#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn display_int() {
assert_eq!(format!("{}", Expr::Int(42)), "42");
}
#[test]
fn display_call() {
let expr = Expr::Call(
Box::new(Expr::Symbol("+".to_string())),
vec![Expr::Int(1), Expr::Int(2)],
);
assert_eq!(format!("{}", expr), "(+ 1 2)");
}
}
}
parser.rs — test parsing of every construct:
#![allow(unused)]
fn main() {
#[test]
fn parse_factorial() {
let input = "(define (fact n) (if (= n 0) 1 (* n (fact (- n 1)))))";
let (remaining, expr) = parse_expr(input).unwrap();
assert_eq!(remaining, "");
match expr {
Expr::DefineFunc(name, params, body) => {
assert_eq!(name, "fact");
assert_eq!(params, vec!["n".to_string()]);
assert_eq!(body.len(), 1); // single if expression
}
_ => panic!("expected DefineFunc"),
}
}
}
analysis.rs — test valid and invalid programs:
#![allow(unused)]
fn main() {
#[test]
fn recursive_function_is_valid() {
let exprs = vec![Expr::DefineFunc(
"f".to_string(),
vec!["x".to_string()],
vec![Expr::Call(
Box::new(Expr::Symbol("f".to_string())),
vec![Expr::Symbol("x".to_string())],
)],
)];
assert!(analyse(&exprs).is_ok());
}
}
codegen.rs — test C output for known inputs:
#![allow(unused)]
fn main() {
#[test]
fn gen_full_program() {
let exprs = vec![
Expr::Call(
Box::new(Expr::Symbol("display".to_string())),
vec![Expr::Int(42)],
),
];
let output = generate(&exprs);
assert!(output.contains("display_int(42LL)"));
assert!(output.contains("int main()"));
}
}
Integration tests
Integration tests live in the tests/ directory. Each test compiles a MiniLisp program, runs the C compiler, executes the binary, and asserts on the output:
#![allow(unused)]
fn main() {
// tests/integration.rs
use std::process::Command;
use std::io::Write;
fn compile_and_run(minilisp: &str) -> String {
// Run our compiler
let mut child = Command::new("cargo")
.args(["run", "--quiet"])
.stdin(std::process::Stdio::piped())
.stdout(std::process::Stdio::piped())
.stderr(std::process::Stdio::piped())
.spawn()
.expect("failed to start compiler");
child
.stdin
.as_mut()
.unwrap()
.write_all(minilisp.as_bytes())
.unwrap();
let output = child.wait_with_output().unwrap();
assert!(
output.status.success(),
"compiler failed: {}",
String::from_utf8_lossy(&output.stderr)
);
let c_code = String::from_utf8(output.stdout).unwrap();
// Write C to a temp file
let dir = tempfile::tempdir().unwrap();
let c_path = dir.path().join("test.c");
let bin_path = dir.path().join("test");
std::fs::write(&c_path, &c_code).unwrap();
// Compile C
let cc = Command::new("cc")
.args([
c_path.to_str().unwrap(),
"-o",
bin_path.to_str().unwrap(),
])
.output()
.expect("failed to run cc");
assert!(
cc.status.success(),
"cc failed: {}",
String::from_utf8_lossy(&cc.stderr)
);
// Run the binary
let run = Command::new(bin_path)
.output()
.expect("failed to run binary");
String::from_utf8(run.stdout).unwrap()
}
#[test]
fn test_display_integer() {
let output = compile_and_run("(display 42) (newline)");
assert_eq!(output, "42\n");
}
#[test]
fn test_arithmetic() {
let output = compile_and_run("(display (+ 3 (* 4 5))) (newline)");
assert_eq!(output, "23\n");
}
#[test]
fn test_if_true() {
let output = compile_and_run("(display (if #t 1 0)) (newline)");
assert_eq!(output, "1\n");
}
#[test]
fn test_if_false() {
let output = compile_and_run("(display (if #f 1 0)) (newline)");
assert_eq!(output, "0\n");
}
#[test]
fn test_define_and_use() {
let output = compile_and_run("(define x 10) (display x) (newline)");
assert_eq!(output, "10\n");
}
#[test]
fn test_function_definition() {
let output = compile_and_run(
"(define (double x) (* x 2)) (display (double 21)) (newline)"
);
assert_eq!(output, "42\n");
}
#[test]
fn test_recursion() {
let output = compile_and_run(
"(define (fact n) (if (= n 0) 1 (* n (fact (- n 1))))) \
(display (fact 10)) (newline)"
);
assert_eq!(output, "3628800\n");
}
#[test]
fn test_let_binding() {
let output = compile_and_run(
"(display (let ((x 10) (y 20)) (+ x y))) (newline)"
);
assert_eq!(output, "30\n");
}
#[test]
fn test_begin() {
let output = compile_and_run(
"(begin (display 1) (display 2) (display 3)) (newline)"
);
assert_eq!(output, "123\n");
}
#[test]
fn test_string_display() {
let output = compile_and_run("(display \"hello world\") (newline)");
assert_eq!(output, "hello world\n");
}
#[test]
fn test_boolean_display() {
let output = compile_and_run("(display #t) (newline)");
assert_eq!(output, "#t\n");
}
#[test]
fn test_comparison() {
let output = compile_and_run(
"(display (if (< 3 5) 1 0)) (newline)"
);
assert_eq!(output, "1\n");
}
}
Test corpus checklist
Ensure you have tests covering:
- Integer literals (positive, negative, zero)
- Boolean literals
- String literals (with escape sequences)
- Arithmetic operators (
+,-,*,/) - Comparison operators (
=,<,>,<=,>=) - Boolean operators (
not,and,or) -
displayfor each type -
newline -
define(variable) -
define(function, including recursive) -
if(both branches) -
lambda -
letbindings -
beginsequencing - Nested expressions
- Comments in source
- Invalid programs (undefined symbols, malformed forms)
Running the tests
# Unit tests
cargo test
# Integration tests (requires cc)
cargo test --test integration
18. What’s Next: Extensions and Further Reading
The compiler you have built is deliberately minimal — a solid foundation. This final section surveys the directions you can take it further: tail-call optimisation, closures and lambda lifting, a garbage collector, hygienic macros, a type system, an interactive REPL, and a self-hosting MiniLisp standard library. It closes with a curated reading list for going deeper into compiler theory and Lisp implementation.
Extensions
Tail-call optimisation. Recursive functions like fact currently grow the C call stack with every recursive call. In Lisp, tail calls are expected to run in constant space. You can implement this by detecting tail-position calls and compiling them as goto jumps or loops in the generated C.
Closures and lambda lifting. Our compiler handles lambdas that do not capture variables from their enclosing scope. True closures — lambdas that reference variables from outer scopes — require either lambda lifting (passing free variables as extra parameters) or a closure data structure that packages the function pointer with its captured environment.
Garbage collection. MiniLisp allocates strings but never frees them. For long-running programs, you would add a garbage collector. A simple mark-and-sweep collector is a good starting point: maintain a list of all allocated objects, periodically mark the ones reachable from the stack and global variables, and free the rest.
Hygienic macros. Lisp macros transform code before it is evaluated. A macro system would let MiniLisp users define their own special forms. Hygienic macros (as in Scheme’s syntax-rules) ensure that macro-generated code does not accidentally capture user variables.
A type system. Adding type annotations and inference would catch errors like passing a string to + at compile time instead of generating C that segfaults. Hindley-Milner type inference is the classic starting point for functional languages.
An interactive REPL. A Read-Eval-Print Loop would let users type MiniLisp expressions and see results immediately. You could implement this by compiling each expression to C, linking it dynamically, and calling it — or by adding an interpreter mode alongside the compiler.
A standard library. Write a MiniLisp standard library in MiniLisp: list operations (map, filter, fold), higher-order functions, and utility procedures. This exercises the compiler and reveals any limitations in the language.
Self-hosting. The ultimate test of a compiler: rewrite the compiler itself in MiniLisp. This requires adding enough features to MiniLisp (file I/O, string manipulation, data structures) to make it practical, then translating the Rust code into MiniLisp.
Further reading
Compiler construction:
- Compilers: Principles, Techniques, and Tools by Aho, Lam, Sethi, and Ullman (the “Dragon Book”) — the canonical compiler textbook.
- Engineering a Compiler by Cooper and Torczon — a modern, practical alternative.
- Crafting Interpreters by Robert Nystrom — an excellent, free online book that walks through building two complete interpreters.
Lisp implementation:
- Structure and Interpretation of Computer Programs (SICP) by Abelson and Sussman — the classic computer science textbook, centred on Scheme.
- Lisp in Small Pieces by Christian Queinnec — deep coverage of Lisp implementation strategies, from interpreters to compilers.
- An Incremental Approach to Compiler Construction by Abdulaziz Ghuloum — a paper describing how to build a Scheme-to-x86 compiler incrementally, one feature at a time.
Rust and parsing:
- The nom documentation and nom recipes — essential reference for parser combinators in Rust.
- Programming Rust by Blandy, Orendorff, and Tindall — comprehensive coverage of Rust, including traits, generics, and the borrow checker.
Congratulations
You have built a working compiler from scratch. It reads MiniLisp, validates it, and emits C. You understand the core pipeline — parsing, semantic analysis, code generation — that underlies every compiler, from GCC to the Rust compiler itself. The techniques you have learned here transfer directly to building compilers, interpreters, static analysers, linters, and code formatters for any language. The next step is yours: pick an extension, implement it, and keep building.