Building a Simple LLM from Scratch

A hands-on course building a small GPT-1-style language model in Rust — from raw text to a trained, sampling transformer.

Part 1 — Language Modeling Basics

§1 What is a Language Model?

A language model is a system that assigns probabilities to sequences of tokens. Given some context — a sequence of words, characters, or subwords that have appeared so far — a language model answers the question:

“What token is most likely to come next?”

Formally, a language model estimates the conditional probability distribution:

\[ P(t_{n+1} \mid t_1, t_2, \ldots, t_n) \]

where \( t_1, t_2, \ldots, t_n \) is the context (the tokens seen so far) and \( t_{n+1} \) is the next token.

A concrete example

Suppose we are building a character-level language model trained on English text. Given the context:

The cat sat on the m

Our model might produce a probability distribution like:

Next character	Probability
`a`	0.55
`o`	0.15
`e`	0.10
`i`	0.05
(other)	0.15

The model thinks a is most likely (leading to “mat”, “map”, “man”, etc.), followed by o (“moon”, “mop”, etc.). It learned these patterns from the statistics of its training data — it has never been told English grammar rules.

Autoregressive generation

Language models generate text autoregressively — one token at a time, feeding each generated token back in as context for the next prediction:

Step 1: "The cat" → predict 's'
Step 2: "The cats" → predict 'a'
Step 3: "The catsa" → predict 't'
...and so on

This loop of predict-then-append is the core mechanism behind every text-generating AI, from simple bigram models to GPT-4.

Why is language modeling useful?

Language modeling sounds like a narrow statistical task, but it turns out to be remarkably powerful:

Text generation. Chatbots, story writers, and code assistants all generate text by sampling from a language model.
Representation learning. Training a model to predict the next token forces it to learn deep representations of syntax, semantics, and even factual knowledge.
Foundation for downstream tasks. Models pre-trained on language modeling can be fine-tuned for translation, summarisation, question answering, and more.
Compression perspective. A good language model is a good compressor — it can represent text efficiently by encoding only the “surprising” tokens. This connection to information theory is why language modeling is such a fundamental problem.

The spectrum of language models

Language models range from the trivially simple to the extraordinarily complex:

Simple                                                    Complex
  |                                                          |
  Bigram ──── N-gram ──── RNN ──── LSTM ──── Transformer ── GPT-4
  counts      tables       neural   gated     attention      massive
                           net      RNN       mechanism      scale

In this course, we will build a small Transformer-based language model — the same architecture family that powers GPT, Claude, and other modern LLMs. Ours will be tiny (a few thousand parameters), but it contains every essential component of its larger cousins.

Key takeaway: A language model predicts the next token given context. Despite sounding simple, this task — when scaled up — gives rise to the capabilities we see in modern AI systems.

§2 Character-Level Tokenisation

Before a language model can process text, we need to convert raw text into numbers. This process is called tokenisation — splitting text into discrete units (tokens) and mapping each to a numerical ID.

Three approaches to tokenisation

There are three main strategies, each with different tradeoffs:

Strategy	Token unit	Vocabulary size	Example: “cats”
Character-level	Single characters	~100	`['c', 'a', 't', 's']`
Subword (BPE, etc.)	Character groups	~30,000-100,000	`['cat', 's']`
Word-level	Whole words	~100,000+	`['cats']`

Word-level tokenisation is simple but struggles with misspellings, rare words, and morphological variation (“run”, “running”, “ran” are all separate tokens).

Subword methods like Byte-Pair Encoding (BPE) — used by GPT models — strike a balance: common words get their own token, while rare words are split into pieces. The word “unhappiness” might become ["un", "happiness"].

Character-level tokenisation is the simplest: every character in the text is its own token. The vocabulary is tiny — just the set of unique characters in the training data.

Why character-level for this course?

We use character-level tokenisation because:

Simplicity. No external tokeniser library is needed. We can build it from scratch in a few lines of Rust.
Small vocabulary. A typical English text corpus has fewer than 100 unique characters, which means smaller embedding tables and faster training.
No unknown tokens. Any character in the input can be represented — there is no “out of vocabulary” problem.
Educational clarity. It is easy to inspect what the model is learning when each token is a single visible character.

The downsides are that sequences become long (the word “language” is 8 tokens instead of 1) and the model must learn to spell words from individual characters. For our small educational model, these tradeoffs are acceptable.

Building a vocabulary

Given a training corpus, we construct a vocabulary by:

Collecting all unique characters in the text.
Sorting them (for deterministic ordering).
Assigning each character a unique integer ID.

For example, given the text "hello world":

Unique characters (sorted): [' ', 'd', 'e', 'h', 'l', 'o', 'r', 'w']

Character → ID mapping:
  ' ' → 0
  'd' → 1
  'e' → 2
  'h' → 3
  'l' → 4
  'o' → 5
  'r' → 6
  'w' → 7

Encoding and decoding

Encoding converts a string into a sequence of token IDs:

"hello" → [3, 2, 4, 4, 5]

Decoding converts a sequence of token IDs back into a string:

[7, 5, 6, 4, 1] → "world"

These two operations must be perfect inverses — decode(encode(text)) == text — or we lose information.

A note on special tokens

Production tokenisers often include special tokens like <PAD> (padding), <BOS> (beginning of sequence), and <EOS> (end of sequence). For our simple character-level tokeniser, we will not use these — every token corresponds to a real character in the text.

Key takeaway: Character-level tokenisation maps each character to an integer. It is the simplest tokenisation scheme and ideal for learning, though real LLMs use subword methods for efficiency.

§3 Exercise 1: Build a Character-Level Tokeniser in Rust

In this exercise, we build a CharTokeniser struct that can encode text into token IDs and decode them back.

Project setup

Create a new Rust project:

cargo new llm-from-scratch
cd llm-from-scratch

Your Cargo.toml needs only the standard library for now. We will add candle in later exercises:

[package]
name = "llm-from-scratch"
version = "0.1.0"
edition = "2021"

[dependencies]
# We will add candle-core and candle-nn later

[profile.release]
opt-level = "z"
lto = true
strip = true
codegen-units = 1

The `CharTokeniser` struct

Create src/tokeniser.rs:

#![allow(unused)]
fn main() {
use std::collections::HashMap;

/// A character-level tokeniser that maps individual characters
/// to integer IDs and back.
pub struct CharTokeniser {
    /// Maps each character to its token ID.
    char_to_id: HashMap<char, u32>,
    /// Maps each token ID back to its character.
    id_to_char: Vec<char>,
}

impl CharTokeniser {
    /// Build a tokeniser from a training corpus.
    ///
    /// Collects all unique characters, sorts them, and assigns
    /// sequential IDs starting from 0.
    ///
    /// # Example
    /// ```
    /// let tok = CharTokeniser::from_corpus("hello world");
    /// assert_eq!(tok.vocab_size(), 8); // ' ', 'd', 'e', 'h', 'l', 'o', 'r', 'w'
    /// ```
    pub fn from_corpus(text: &str) -> Self {
        let mut chars: Vec<char> = text.chars().collect::<std::collections::HashSet<_>>()
            .into_iter()
            .collect();
        chars.sort();

        let char_to_id: HashMap<char, u32> = chars
            .iter()
            .enumerate()
            .map(|(i, &c)| (c, i as u32))
            .collect();

        CharTokeniser {
            char_to_id,
            id_to_char: chars,
        }
    }

    /// Returns the number of unique tokens in the vocabulary.
    pub fn vocab_size(&self) -> usize {
        self.id_to_char.len()
    }

    /// Encode a string into a sequence of token IDs.
    ///
    /// # Panics
    /// Panics if the string contains a character not in the vocabulary.
    pub fn encode(&self, text: &str) -> Vec<u32> {
        text.chars()
            .map(|c| {
                *self.char_to_id
                    .get(&c)
                    .unwrap_or_else(|| panic!("Character '{}' not in vocabulary", c))
            })
            .collect()
    }

    /// Decode a sequence of token IDs back into a string.
    ///
    /// # Panics
    /// Panics if any token ID is out of range.
    pub fn decode(&self, ids: &[u32]) -> String {
        ids.iter()
            .map(|&id| {
                *self.id_to_char
                    .get(id as usize)
                    .unwrap_or_else(|| panic!("Token ID {} out of range", id))
            })
            .collect()
    }

    /// Print the vocabulary mapping for inspection.
    pub fn print_vocab(&self) {
        println!("Vocabulary ({} tokens):", self.vocab_size());
        for (i, c) in self.id_to_char.iter().enumerate() {
            let display = match c {
                '\n' => "\\n".to_string(),
                '\t' => "\\t".to_string(),
                ' ' => "' '".to_string(),
                _ => format!("'{}'", c),
            };
            println!("  {} → {}", display, i);
        }
    }
}
}

Wiring it up in `main.rs`

In src/main.rs:

mod tokeniser;

use tokeniser::CharTokeniser;

fn main() {
    let corpus = "\
To be, or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles.";

    let tok = CharTokeniser::from_corpus(corpus);
    tok.print_vocab();

    let sample = "to be";
    let encoded = tok.encode(sample);
    println!("\nEncoded \"{}\": {:?}", sample, encoded);

    let decoded = tok.decode(&encoded);
    println!("Decoded back: \"{}\"", decoded);

    // Verify round-trip
    assert_eq!(decoded, sample);
    println!("\nRound-trip check passed.");
}

Expected output

When you run cargo run, you should see something like:

Vocabulary (39 tokens):
  '\n' → 0
  ' ' → 1
  ''' → 2
  ',' → 3
  '.' → 4
  ':' → 5
  'O' → 6
  'T' → 7
  'W' → 8
  'a' → 9
  ...

Encoded "to be": [30, 21, 1, 12, 15]
Decoded back: "to be"

Round-trip check passed.

(Your exact IDs will depend on the characters present in the corpus.)

Exercises to try

Extend the corpus. Download a larger text (e.g., from Project Gutenberg) and see how the vocabulary grows.
Handle unknown characters. Modify encode to return an Option or use a special <UNK> token instead of panicking.
Measure sequence length. Encode a paragraph and compare the number of tokens to the number of words. How much longer is the character-level encoding?

Key takeaway: A character-level tokeniser is just two hash maps — one from characters to IDs and one from IDs to characters. The from_corpus method automatically builds the vocabulary from whatever text you give it.

Part 2 — The Transformer Architecture

§4 Embeddings and Positional Encoding

Our tokeniser converts text into a sequence of integer IDs. But neural networks work with continuous vectors, not discrete integers. Embeddings bridge this gap by mapping each token ID to a dense vector of floating-point numbers.

What is an embedding?

An embedding is a lookup table — a matrix of shape (vocab_size, embed_dim) where each row is a learnable vector representing one token.

Embedding table (vocab_size=5, embed_dim=4):

Token ID 0 → [ 0.12, -0.34,  0.56,  0.78]
Token ID 1 → [-0.91,  0.23,  0.45, -0.67]
Token ID 2 → [ 0.33,  0.11, -0.88,  0.54]
Token ID 3 → [ 0.76, -0.55,  0.22,  0.13]
Token ID 4 → [-0.42,  0.89, -0.11,  0.66]

Given the input token IDs [2, 0, 3], the embedding layer simply looks up rows 2, 0, and 3:

Input:  [2, 0, 3]
Output: [[ 0.33,  0.11, -0.88,  0.54],   ← row 2
         [ 0.12, -0.34,  0.56,  0.78],   ← row 0
         [ 0.76, -0.55,  0.22,  0.13]]   ← row 3

The result is a matrix of shape (sequence_length, embed_dim). Initially, these vectors are random. During training, backpropagation adjusts them so that tokens with similar meanings end up with similar vectors.

Why do we need embeddings?

Integer IDs have no inherent structure — the fact that “a” is token 0 and “b” is token 1 does not mean they are “close” in any useful sense. Embeddings give the model a continuous space where it can represent relationships. After training, you might find that:

Vowel characters cluster together.
Uppercase and lowercase versions of the same letter are nearby.
Punctuation characters form their own cluster.

The embedding dimension

The embedding dimension (embed_dim or d_model) is a hyperparameter you choose. It controls how much information each token vector can carry:

Embed dim	Capacity	Training cost	Typical use
16-64	Low	Very fast	Toy models (like ours)
128-512	Medium	Moderate	Small-scale experiments
768-4096	High	Expensive	Production LLMs

For our tiny model, we will use embed_dim = 64.

The position problem

Consider two sentences:

"The cat ate the fish"
"The fish ate the cat"

These contain the exact same tokens but have very different meanings. If we only embed the tokens, the model has no way to tell the two sentences apart — it does not know the order of the tokens.

Positional encoding

To inject position information, we add a positional encoding vector to each token embedding. The original Transformer paper (“Attention Is All You Need”) proposed using fixed sinusoidal functions:

\[ PE(pos, 2i) = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) \] \[ PE(pos, 2i+1) = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right) \]

Where:

pos is the position in the sequence (0, 1, 2, …).
i is the dimension index.
d_model is the embedding dimension.

These functions produce a unique pattern for each position. Low-frequency sinusoids change slowly across positions (capturing coarse position), while high-frequency sinusoids change rapidly (capturing fine position).

Position 0:  [sin(0), cos(0), sin(0), cos(0), ...]  = [0.00, 1.00, 0.00, 1.00, ...]
Position 1:  [sin(1), cos(1), sin(ε), cos(ε), ...]  = [0.84, 0.54, 0.01, 1.00, ...]
Position 2:  [sin(2), cos(2), sin(2ε), cos(2ε),...] = [0.91, -0.42, 0.02, 1.00, ...]

(Here ε = 1/10000^(2/d_model), a very small number for higher dimensions.)

Learned positional embeddings

GPT-1 and GPT-2 use a simpler approach: a second embedding table of shape (max_seq_len, embed_dim) where each position gets its own learnable vector, just like tokens do. This is what we will implement:

Final input = token_embedding[token_id] + position_embedding[position]

Both the token embeddings and position embeddings are learned during training.

Putting it all together

The input pipeline for our model looks like this:

"hello" → [3, 2, 4, 4, 5]              (tokenisation)
         → [[0.12, ...], [-0.91, ...],   (token embedding lookup)
            [0.33, ...], [0.33, ...],
            [-0.42, ...]]
         +  [[0.05, ...], [0.11, ...],   (position embedding lookup)
             [0.22, ...], [0.08, ...],
             [0.17, ...]]
         = [[0.17, ...], [-0.80, ...],   (element-wise addition)
            [0.55, ...], [0.41, ...],
            [-0.25, ...]]

The result is a (seq_len, embed_dim) matrix that carries both what each token is and where it appears. This matrix is what we feed into the Transformer blocks.

Key takeaway: Embeddings convert token IDs into learnable vectors. Positional encodings add position information so the model can distinguish “the cat ate the fish” from “the fish ate the cat”. Together, they form the input to the Transformer.

§5 Self-Attention: Queries, Keys, and Values

Self-attention is the mechanism that makes Transformers work. It allows every token in a sequence to look at every other token and decide which ones are relevant. This is the single most important concept in this course.

The intuition: a database lookup

Think of self-attention as a soft database lookup:

Each token formulates a query: “What kind of information am I looking for?”
Each token advertises a key: “Here is what kind of information I have.”
Each token holds a value: “Here is my actual information.”

To process a token, we compare its query against all keys. Where there is a strong match, we pull in the corresponding value. The result is a weighted combination of all values, where the weights reflect how relevant each token is.

Token: "sat"
Query: "I'm a verb — who is my subject?"

Keys available:
  "The" → "I'm a determiner"           (low match)
  "cat" → "I'm a noun/subject"         (HIGH match)
  "sat" → "I'm a verb"                 (medium match)
  "on"  → "I'm a preposition"          (low match)

Result: mostly attend to "cat", somewhat to "sat", barely to others

Of course, the model does not literally think in words — these “queries” and “keys” are learned vector representations. But the analogy captures the mechanism.

The math: scaled dot-product attention

Given a sequence of n token embeddings (each of dimension d_model), self-attention works as follows:

Step 1: Project into Q, K, V.

We use three learned weight matrices \( W_Q, W_K, W_V \) (each of shape d_model × d_k) to produce:

\[ Q = X W_Q, \quad K = X W_K, \quad V = X W_V \]

Where \( X \) is the input matrix of shape (n, d_model) and \( Q, K, V \) are each of shape (n, d_k).

Step 2: Compute attention scores.

We compute the dot product of each query with all keys:

\[ \text{scores} = Q K^T \]

This produces an (n, n) matrix where entry (i, j) measures how much token i should attend to token j.

Step 3: Scale.

We divide by \( \sqrt{d_k} \) to prevent the dot products from becoming too large (which would push softmax into regions with tiny gradients):

\[ \text{scaled_scores} = \frac{Q K^T}{\sqrt{d_k}} \]

Step 4: Softmax.

We apply softmax row-wise so that each row sums to 1, giving us a proper probability distribution:

\[ \text{attention_weights} = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) \]

Step 5: Weighted sum of values.

We multiply the attention weights by the values:

\[ \text{output} = \text{attention_weights} \times V \]

The complete formula is:

\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V \]

A small numeric example

Let us walk through attention with a tiny example. Suppose we have 3 tokens with d_k = 2:

Q (queries):          K (keys):            V (values):
[1.0, 0.0]           [1.0, 0.0]           [1.0, 0.0]
[0.0, 1.0]           [0.0, 1.0]           [0.0, 1.0]
[1.0, 1.0]           [1.0, 1.0]           [0.5, 0.5]

Step 2 — QK^T:

           K^T = [1.0  0.0  1.0]
                 [0.0  1.0  1.0]

QK^T = [1.0*1.0+0.0*0.0  1.0*0.0+0.0*1.0  1.0*1.0+0.0*1.0]   [1.0  0.0  1.0]
       [0.0*1.0+1.0*0.0  0.0*0.0+1.0*1.0  0.0*1.0+1.0*1.0] = [0.0  1.0  1.0]
       [1.0*1.0+1.0*0.0  1.0*0.0+1.0*1.0  1.0*1.0+1.0*1.0]   [1.0  1.0  2.0]

Step 3 — Scale by √d_k = √2 ≈ 1.414:

Scaled = [0.71  0.00  0.71]
         [0.00  0.71  0.71]
         [0.71  0.71  1.41]

Step 4 — Softmax (row-wise):

Row 0: softmax([0.71, 0.00, 0.71]) ≈ [0.39, 0.19, 0.42]
Row 1: softmax([0.00, 0.71, 0.71]) ≈ [0.19, 0.39, 0.42]
Row 2: softmax([0.71, 0.71, 1.41]) ≈ [0.24, 0.24, 0.52]

Step 5 — Multiply by V:

Output row 0 = 0.39*[1,0] + 0.19*[0,1] + 0.42*[0.5,0.5] = [0.60, 0.40]
Output row 1 = 0.19*[1,0] + 0.39*[0,1] + 0.42*[0.5,0.5] = [0.40, 0.60]
Output row 2 = 0.24*[1,0] + 0.24*[0,1] + 0.52*[0.5,0.5] = [0.50, 0.50]

Token 2 (whose query was [1,1] — “I want everything”) ends up with a balanced mixture [0.50, 0.50]. Token 0 (query [1,0]) ends up leaning toward the first dimension [0.60, 0.40]. The attention mechanism has routed information according to what each token asked for.

Why self-attention works

Self-attention has two properties that make it powerful:

Global context. Every token can attend to every other token in a single step. In an RNN, information must flow through many sequential steps to get from one end of the sequence to the other.
Content-based routing. Which tokens to attend to is determined by the content (via Q and K), not by fixed connectivity patterns. The model learns to route information dynamically.

RNN: information flows sequentially          Attention: direct connections
  t1 → t2 → t3 → t4 → t5                    t1 ←→ t2 ←→ t3 ←→ t4 ←→ t5
  (5 steps from t1 to t5)                    (1 step between any pair)

Key takeaway: Self-attention computes a weighted sum of value vectors, where the weights are determined by the similarity between query and key vectors. The formula \(\text{softmax}(QK^T / \sqrt{d_k}) V\) is the mathematical heart of the Transformer.

§6 The Transformer Block

A single self-attention layer is powerful but not sufficient. The Transformer block wraps attention in a series of components that make training stable and learning more expressive.

Components of a Transformer block

A Transformer block contains four components, applied in order:

Multi-head self-attention
Residual connection + Layer normalisation
Feed-forward network (FFN)
Residual connection + Layer normalisation

Here is the full block in ASCII art:

          ┌──────────────────────┐
          │      Input (x)       │
          └──────────┬───────────┘
                     │
                     ├──────────────────────┐
                     ▼                      │ (residual)
          ┌──────────────────────┐          │
          │  Multi-Head          │          │
          │  Self-Attention      │          │
          └──────────┬───────────┘          │
                     │                      │
                     ▼                      │
                   (+) ◄────────────────────┘
                     │
                     ▼
          ┌──────────────────────┐
          │   Layer Norm         │
          └──────────┬───────────┘
                     │
                     ├──────────────────────┐
                     ▼                      │ (residual)
          ┌──────────────────────┐          │
          │  Feed-Forward        │          │
          │  Network (FFN)       │          │
          └──────────┬───────────┘          │
                     │                      │
                     ▼                      │
                   (+) ◄────────────────────┘
                     │
                     ▼
          ┌──────────────────────┐
          │   Layer Norm         │
          └──────────┬───────────┘
                     │
                     ▼
          ┌──────────────────────┐
          │     Output           │
          └──────────────────────┘

Let us look at each component.

Multi-head self-attention

Instead of running a single attention computation, multi-head attention runs multiple attention heads in parallel, each with its own \( W_Q, W_K, W_V \) matrices. Each head can learn to focus on different types of relationships:

Head 1 might learn syntactic relationships (subject-verb agreement).
Head 2 might learn positional proximity (nearby characters).
Head 3 might learn semantic similarity.

If d_model = 64 and we use 4 heads, each head operates on d_k = d_model / num_heads = 16 dimensions. The outputs of all heads are concatenated and projected back to d_model:

\[ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W_O \]

where each \( \text{head}_i = \text{Attention}(X W_Q^i, X W_K^i, X W_V^i) \).

Residual connections

A residual connection (or skip connection) adds the input of a sublayer directly to its output:

\[ \text{output} = \text{sublayer}(x) + x \]

This seemingly simple trick is crucial for training deep networks. It ensures that gradients can flow directly through the network during backpropagation, preventing the vanishing gradient problem that plagues deep architectures. Even if a sublayer learns nothing useful, the residual connection ensures the signal passes through unchanged.

Layer normalisation

Layer normalisation normalises the values across the feature dimension for each token independently:

\[ \text{LayerNorm}(x) = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \cdot \gamma + \beta \]

Where \( \mu \) and \( \sigma^2 \) are the mean and variance computed across the feature dimension, and \( \gamma, \beta \) are learnable scale and shift parameters.

Layer norm keeps activations in a reasonable range, which stabilises training. Without it, values can explode or collapse as they pass through many layers.

Feed-forward network

The FFN is a simple two-layer neural network applied independently to each token position:

\[ \text{FFN}(x) = \text{GELU}(x W_1 + b_1) W_2 + b_2 \]

The inner dimension is typically 4x the model dimension (e.g., d_model = 64 → d_ff = 256). The GELU activation function is a smooth approximation of ReLU used in GPT models.

The FFN is where much of the model’s “knowledge” is stored. While attention determines what information to combine, the FFN transforms that information.

Pre-norm vs post-norm

The original Transformer used post-norm (normalise after the residual add). GPT-2 and most modern models use pre-norm (normalise before the sublayer). Pre-norm is more stable to train:

Post-norm (original):  LayerNorm(x + Sublayer(x))
Pre-norm (GPT-2):      x + Sublayer(LayerNorm(x))

We will use pre-norm in our implementation.

Key takeaway: A Transformer block combines multi-head attention (for context mixing), a feed-forward network (for per-token transformation), residual connections (for gradient flow), and layer normalisation (for training stability). These four ingredients, repeated many times, give Transformers their power.

§7 Exercise 2: Implement Self-Attention in Rust

In this exercise, we implement single-head scaled dot-product self-attention using the candle tensor library. This is the core computation from §5 translated into Rust.

Adding candle to your project

Update your Cargo.toml:

[package]
name = "llm-from-scratch"
version = "0.1.0"
edition = "2021"

[dependencies]
candle-core = "0.8"
candle-nn = "0.8"
anyhow = "1"

[profile.release]
opt-level = "z"
lto = true
strip = true
codegen-units = 1

Single-head self-attention

Create src/attention.rs:

#![allow(unused)]
fn main() {
use candle_core::{Device, Result, Tensor, D};

/// Compute single-head scaled dot-product self-attention.
///
/// # Arguments
/// * `q` - Query tensor of shape `(seq_len, d_k)`
/// * `k` - Key tensor of shape `(seq_len, d_k)`
/// * `v` - Value tensor of shape `(seq_len, d_k)`
///
/// # Returns
/// Output tensor of shape `(seq_len, d_k)` — the attention-weighted
/// combination of values.
pub fn scaled_dot_product_attention(
    q: &Tensor,
    k: &Tensor,
    v: &Tensor,
) -> Result<Tensor> {
    let d_k = q.dim(D::Minus1)? as f64;

    // Step 1: QK^T — compute attention scores
    // q: (seq_len, d_k), k^T: (d_k, seq_len) → scores: (seq_len, seq_len)
    let scores = q.matmul(&k.t()?)?;

    // Step 2: Scale by sqrt(d_k)
    let scaled = (scores / d_k.sqrt())?;

    // Step 3: Softmax along the last dimension (row-wise)
    let weights = candle_nn::ops::softmax(&scaled, D::Minus1)?;

    // Step 4: Weighted sum of values
    // weights: (seq_len, seq_len) × v: (seq_len, d_k) → (seq_len, d_k)
    let output = weights.matmul(v)?;

    Ok(output)
}

/// Project input through a linear layer (matrix multiply) to produce Q, K, or V.
///
/// # Arguments
/// * `x` - Input tensor of shape `(seq_len, d_model)`
/// * `w` - Weight matrix of shape `(d_model, d_k)`
///
/// # Returns
/// Projected tensor of shape `(seq_len, d_k)`.
pub fn project(x: &Tensor, w: &Tensor) -> Result<Tensor> {
    x.matmul(w)
}
}

Testing it in `main.rs`

Add the module and a test function to src/main.rs:

mod attention;
mod tokeniser;

use anyhow::Result;
use candle_core::{Device, Tensor};

fn demo_attention() -> Result<()> {
    let device = &Device::Cpu;

    // Simulate 4 token embeddings, each of dimension 8
    let seq_len = 4;
    let d_model = 8;
    let d_k = 8; // Same as d_model for single-head

    // Random input "embeddings"
    let x = Tensor::randn(0f32, 1.0, (seq_len, d_model), device)?;

    // Random projection weights (in a real model, these are learned)
    let w_q = Tensor::randn(0f32, 1.0, (d_model, d_k), device)?;
    let w_k = Tensor::randn(0f32, 1.0, (d_model, d_k), device)?;
    let w_v = Tensor::randn(0f32, 1.0, (d_model, d_k), device)?;

    // Project input into Q, K, V
    let q = attention::project(&x, &w_q)?;
    let k = attention::project(&x, &w_k)?;
    let v = attention::project(&x, &w_v)?;

    println!("Input shape: {:?}", x.shape());
    println!("Q shape: {:?}", q.shape());
    println!("K shape: {:?}", k.shape());
    println!("V shape: {:?}", v.shape());

    // Compute attention
    let output = attention::scaled_dot_product_attention(&q, &k, &v)?;
    println!("Output shape: {:?}", output.shape());
    println!("\nAttention output:\n{}", output);

    Ok(())
}

fn main() {
    if let Err(e) = demo_attention() {
        eprintln!("Error: {}", e);
    }
}

Expected output

Input shape: [4, 8]
Q shape: [4, 8]
K shape: [4, 8]
V shape: [4, 8]
Output shape: [4, 8]

Attention output:
[[ 0.1234, -0.5678, ...],
 [ 0.2345, -0.4567, ...],
 [ 0.3456, -0.3456, ...],
 [ 0.4567, -0.2345, ...]]

(Your exact numbers will differ because of random initialisation.)

What to observe

After running the code, notice:

Shape preservation. The output has the same shape as the input — (seq_len, d_k). Each token position gets a new vector that is a weighted combination of all value vectors.
Row similarity. The output rows tend to be more similar to each other than the input rows. This is because attention mixes information across all positions.
Softmax effect. If you print the attention weights (the output of softmax), you will see that each row sums to 1.0 and typically has one or two dominant values.

Exercises to try

Print the attention weights matrix. After the softmax step, print the (seq_len, seq_len) weight matrix. Which tokens attend most strongly to which?
Add a causal mask. Before softmax, set the upper-triangle entries of the score matrix to negative infinity. This prevents each position from attending to future positions. (Hint: use Tensor::ones and Tensor::tril to build a mask.)
Compare with and without scaling. Remove the / d_k.sqrt() and observe how the attention weights change — they should become much more “peaky” (concentrated on one token).

Key takeaway: Self-attention in code is just three matrix multiplications (to project Q, K, V), one more multiply (QK^T), a scale, a softmax, and a final multiply by V. The candle crate provides all the tensor operations we need.

Part 3 — Assembling the Model

§8 A Decoder-Only LM: Stacking Blocks and the Causal Mask

We now have all the pieces — embeddings, positional encoding, and Transformer blocks. It is time to assemble them into a complete language model. We will build a decoder-only Transformer, the architecture used by GPT-1, GPT-2, GPT-3, and many other LLMs.

Why “decoder-only”?

The original Transformer (2017) had two halves:

An encoder that processes an input sequence (e.g., a French sentence).
A decoder that generates an output sequence (e.g., the English translation), attending to both itself and the encoder output.

GPT (2018) showed that you only need the decoder half. By training the decoder to predict the next token in a single sequence, you get a general-purpose language model. No encoder, no cross-attention — just self-attention with a causal mask.

Original Transformer:           Decoder-only (GPT):
┌──────────┐ ┌──────────┐      ┌──────────┐
│ Encoder  │→│ Decoder  │      │ Decoder  │
│ (bidir.) │ │ (causal) │      │ (causal) │
└──────────┘ └──────────┘      └──────────┘
Used for: translation           Used for: generation

The causal mask

In a decoder-only model, each token can only attend to tokens at or before its position — never to future tokens. This is essential because during generation, future tokens do not exist yet.

We enforce this with a causal mask (also called a “look-ahead mask”) — a lower-triangular matrix that blocks attention to future positions:

Causal mask for sequence length 5:

     t0  t1  t2  t3  t4
t0 [  1   0   0   0   0 ]    ← t0 can only see t0
t1 [  1   1   0   0   0 ]    ← t1 can see t0, t1
t2 [  1   1   1   0   0 ]    ← t2 can see t0, t1, t2
t3 [  1   1   1   1   0 ]    ← t3 can see t0, t1, t2, t3
t4 [  1   1   1   1   1 ]    ← t4 can see everything

In practice, we set the masked positions (zeros above) to \( -\infty \) before the softmax step. Since \( \text{softmax}(-\infty) = 0 \), those positions get zero attention weight.

\[ \text{MaskedAttention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}} + M\right) V \]

where \( M \) has 0 for allowed positions and \( -\infty \) for blocked positions.

The full model architecture

Our GPT-1-style model stacks all the components in this order:

┌─────────────────────────────────────┐
│           Input Token IDs           │
│            [4, 2, 7, 1]             │
└──────────────┬──────────────────────┘
               │
               ▼
┌─────────────────────────────────────┐
│     Token Embedding (lookup)        │
│     + Position Embedding            │
│     → (seq_len, d_model)            │
└──────────────┬──────────────────────┘
               │
               ▼
┌─────────────────────────────────────┐
│      Transformer Block 1            │
│  ┌─ Masked Multi-Head Attention ─┐  │
│  │  + Residual + LayerNorm       │  │
│  │  FFN + Residual + LayerNorm   │  │
│  └───────────────────────────────┘  │
└──────────────┬──────────────────────┘
               │
               ▼
┌─────────────────────────────────────┐
│      Transformer Block 2            │
│  (same structure as Block 1)        │
└──────────────┬──────────────────────┘
               │
               ▼
              ...  (N blocks total)
               │
               ▼
┌─────────────────────────────────────┐
│        Final Layer Norm             │
└──────────────┬──────────────────────┘
               │
               ▼
┌─────────────────────────────────────┐
│    Linear Projection (no bias)      │
│    (d_model → vocab_size)           │
│    Output: logits per token         │
│    → (seq_len, vocab_size)          │
└─────────────────────────────────────┘

The output logits are raw (unnormalised) scores for each token in the vocabulary, at each position in the sequence. To get probabilities, we apply softmax. To get a loss, we compare these logits against the actual next tokens using cross-entropy.

Hyperparameters for our model

We keep things small enough to train on a CPU in seconds:

Hyperparameter	Value	Description
`vocab_size`	~65	Number of unique characters (depends on corpus)
`d_model`	64	Embedding dimension
`n_heads`	4	Number of attention heads
`n_layers`	2	Number of Transformer blocks
`d_ff`	256	FFN inner dimension (4 × d_model)
`max_seq_len`	128	Maximum sequence length

This gives roughly 100K parameters — tiny by modern standards, but sufficient to learn character-level patterns from a small corpus.

How generation works

Once trained, we generate text autoregressively:

Start with a prompt (e.g., “The “).
Encode it to token IDs.
Run the model to get logits for the next position.
Sample a token from the probability distribution (or take the argmax).
Append the sampled token to the sequence.
Repeat from step 3.

Step 1: "The "     → model → next token probabilities → sample 'c'
Step 2: "The c"    → model → next token probabilities → sample 'a'
Step 3: "The ca"   → model → next token probabilities → sample 't'
Step 4: "The cat"  → model → next token probabilities → sample ' '
...

We only need the logits at the last position to generate the next token, but the model processes the entire sequence at once (which is efficient during training).

Key takeaway: A decoder-only language model is a stack of Transformer blocks with causal masking, sandwiched between an embedding layer and a linear output projection. The causal mask ensures each position can only attend to past tokens, enabling autoregressive generation.

§9 Exercise 3: Define the GPT-1-Style Model in `candle`

In this exercise, we define the full model architecture in Rust using candle. We will build the model struct by struct, from the bottom up.

The overall structure

We need these components:

CausalSelfAttention — multi-head attention with causal masking
FeedForward — the two-layer FFN
TransformerBlock — attention + FFN with residual connections and layer norm
Gpt1Model — the full model: embeddings, N blocks, final projection

Configuration

First, define a config struct in src/model.rs:

#![allow(unused)]
fn main() {
use candle_core::{DType, Device, Result, Tensor, D};
use candle_nn::{
    embedding, layer_norm, linear, linear_no_bias, Embedding, LayerNorm,
    Linear, Module, VarBuilder,
};

/// Configuration for our GPT-1-style model.
#[derive(Clone)]
pub struct GptConfig {
    pub vocab_size: usize,
    pub d_model: usize,
    pub n_heads: usize,
    pub n_layers: usize,
    pub d_ff: usize,
    pub max_seq_len: usize,
}

impl GptConfig {
    /// A tiny configuration suitable for CPU training.
    pub fn tiny(vocab_size: usize) -> Self {
        GptConfig {
            vocab_size,
            d_model: 64,
            n_heads: 4,
            n_layers: 2,
            d_ff: 256,
            max_seq_len: 128,
        }
    }
}
}

Causal self-attention

#![allow(unused)]
fn main() {
/// Multi-head causal self-attention.
pub struct CausalSelfAttention {
    qkv_proj: Linear,
    out_proj: Linear,
    n_heads: usize,
    d_k: usize,
}

impl CausalSelfAttention {
    pub fn new(cfg: &GptConfig, vb: VarBuilder) -> Result<Self> {
        let d_k = cfg.d_model / cfg.n_heads;
        // Project Q, K, V in a single linear layer for efficiency.
        let qkv_proj = linear(cfg.d_model, 3 * cfg.d_model, vb.pp("qkv_proj"))?;
        let out_proj = linear(cfg.d_model, cfg.d_model, vb.pp("out_proj"))?;
        Ok(Self { qkv_proj, out_proj, n_heads: cfg.n_heads, d_k })
    }

    pub fn forward(&self, x: &Tensor) -> Result<Tensor> {
        let (seq_len, d_model) = (x.dim(0)?, x.dim(1)?);

        // Project to Q, K, V in one operation, then split
        let qkv = self.qkv_proj.forward(x)?; // (seq_len, 3 * d_model)
        let q = qkv.narrow(1, 0, d_model)?;
        let k = qkv.narrow(1, d_model, d_model)?;
        let v = qkv.narrow(1, 2 * d_model, d_model)?;

        // Reshape for multi-head: (seq_len, n_heads, d_k) then transpose
        // to (n_heads, seq_len, d_k) for batched attention
        let q = q.reshape((seq_len, self.n_heads, self.d_k))?
                  .transpose(0, 1)?;
        let k = k.reshape((seq_len, self.n_heads, self.d_k))?
                  .transpose(0, 1)?;
        let v = v.reshape((seq_len, self.n_heads, self.d_k))?
                  .transpose(0, 1)?;

        // Scaled dot-product attention: (n_heads, seq_len, seq_len)
        let scale = (self.d_k as f64).sqrt();
        let scores = q.matmul(&k.transpose(1, 2)?)?.affine(1.0 / scale, 0.0)?;

        // Causal mask: set future positions to -inf
        let mask = Tensor::ones((seq_len, seq_len), DType::F32, x.device())?
            .tril(0)?;
        let neg_inf = Tensor::ones_like(&mask)?
            .affine(-1e9, 0.0)?
            .affine(1.0, 0.0)?;
        let mask = mask.where_cond(
            &scores.broadcast_left(self.n_heads)?.squeeze(0)?,
            &neg_inf,
        );
        // Simpler approach: build additive mask
        let additive_mask = Tensor::zeros((seq_len, seq_len), DType::F32, x.device())?;
        // We need upper-triangular part to be -inf
        let ones = Tensor::ones((seq_len, seq_len), DType::F32, x.device())?;
        let causal = ones.tril(0)?; // lower triangle = 1, upper = 0
        // Convert: where causal==0, set to -1e9; where causal==1, set to 0
        let additive_mask = ((causal.affine(-1.0, 1.0))? // 0→1, 1→0
            .affine(1e9, 0.0))?  // 0→0, 1→1e9
            .affine(-1.0, 0.0)?; // 0→0, 1→-1e9

        let masked_scores = scores.broadcast_add(&additive_mask)?;
        let weights = candle_nn::ops::softmax(&masked_scores, D::Minus1)?;

        // Weighted sum of values
        let attn_out = weights.matmul(&v)?; // (n_heads, seq_len, d_k)

        // Reshape back: transpose → (seq_len, n_heads, d_k) → (seq_len, d_model)
        let attn_out = attn_out.transpose(0, 1)?
                                .reshape((seq_len, d_model))?;

        // Output projection
        self.out_proj.forward(&attn_out)
    }
}
}

Feed-forward network

#![allow(unused)]
fn main() {
/// Position-wise feed-forward network with GELU activation.
pub struct FeedForward {
    up: Linear,
    down: Linear,
}

impl FeedForward {
    pub fn new(cfg: &GptConfig, vb: VarBuilder) -> Result<Self> {
        let up = linear(cfg.d_model, cfg.d_ff, vb.pp("up"))?;
        let down = linear(cfg.d_ff, cfg.d_model, vb.pp("down"))?;
        Ok(Self { up, down })
    }

    pub fn forward(&self, x: &Tensor) -> Result<Tensor> {
        let h = self.up.forward(x)?.gelu()?;
        self.down.forward(&h)
    }
}
}

Transformer block (pre-norm)

#![allow(unused)]
fn main() {
/// A single Transformer block with pre-norm residual connections.
pub struct TransformerBlock {
    attn: CausalSelfAttention,
    ffn: FeedForward,
    ln1: LayerNorm,
    ln2: LayerNorm,
}

impl TransformerBlock {
    pub fn new(cfg: &GptConfig, vb: VarBuilder) -> Result<Self> {
        let attn = CausalSelfAttention::new(cfg, vb.pp("attn"))?;
        let ffn = FeedForward::new(cfg, vb.pp("ffn"))?;
        let ln1 = layer_norm(cfg.d_model, Default::default(), vb.pp("ln1"))?;
        let ln2 = layer_norm(cfg.d_model, Default::default(), vb.pp("ln2"))?;
        Ok(Self { attn, ffn, ln1, ln2 })
    }

    pub fn forward(&self, x: &Tensor) -> Result<Tensor> {
        // Pre-norm: x + Attn(LayerNorm(x))
        let residual = x;
        let h = self.ln1.forward(x)?;
        let h = self.attn.forward(&h)?;
        let x = (residual + h)?;

        // Pre-norm: x + FFN(LayerNorm(x))
        let residual = &x;
        let h = self.ln2.forward(&x)?;
        let h = self.ffn.forward(&h)?;
        (residual + h)
    }
}
}

The full GPT model

#![allow(unused)]
fn main() {
/// A small GPT-1-style language model.
pub struct Gpt1Model {
    token_emb: Embedding,
    pos_emb: Embedding,
    blocks: Vec<TransformerBlock>,
    ln_f: LayerNorm,
    lm_head: Linear,
}

impl Gpt1Model {
    pub fn new(cfg: &GptConfig, vb: VarBuilder) -> Result<Self> {
        let token_emb = embedding(cfg.vocab_size, cfg.d_model, vb.pp("token_emb"))?;
        let pos_emb = embedding(cfg.max_seq_len, cfg.d_model, vb.pp("pos_emb"))?;

        let mut blocks = Vec::with_capacity(cfg.n_layers);
        for i in 0..cfg.n_layers {
            blocks.push(TransformerBlock::new(cfg, vb.pp(format!("block_{}", i)))?);
        }

        let ln_f = layer_norm(cfg.d_model, Default::default(), vb.pp("ln_f"))?;
        let lm_head = linear_no_bias(cfg.d_model, cfg.vocab_size, vb.pp("lm_head"))?;

        Ok(Self { token_emb, pos_emb, blocks, ln_f, lm_head })
    }

    /// Forward pass: token IDs → logits.
    ///
    /// # Arguments
    /// * `token_ids` - 1D tensor of shape `(seq_len,)` with token IDs.
    ///
    /// # Returns
    /// Logits tensor of shape `(seq_len, vocab_size)`.
    pub fn forward(&self, token_ids: &Tensor) -> Result<Tensor> {
        let seq_len = token_ids.dim(0)?;

        // Create position indices [0, 1, 2, ..., seq_len-1]
        let positions = Tensor::arange(0u32, seq_len as u32, token_ids.device())?;

        // Embed tokens and positions, then add
        let tok_emb = self.token_emb.forward(token_ids)?;
        let pos_emb = self.pos_emb.forward(&positions)?;
        let mut x = (tok_emb + pos_emb)?;

        // Pass through all Transformer blocks
        for block in &self.blocks {
            x = block.forward(&x)?;
        }

        // Final layer norm + projection to vocabulary
        let x = self.ln_f.forward(&x)?;
        self.lm_head.forward(&x)
    }
}
}

Testing the model

In main.rs:

mod model;

use candle_core::{DType, Device, Tensor};
use candle_nn::VarMap;

fn main() -> anyhow::Result<()> {
    let device = &Device::Cpu;
    let varmap = VarMap::new();
    let vb = candle_nn::VarBuilder::from_varmap(&varmap, DType::F32, device);

    let cfg = model::GptConfig::tiny(65); // 65 characters in typical Shakespeare
    let model = model::Gpt1Model::new(&cfg, vb)?;

    // Create a dummy input: 10 token IDs
    let input = Tensor::new(&[0u32, 1, 2, 3, 4, 5, 6, 7, 8, 9], device)?;
    let logits = model.forward(&input)?;

    println!("Input shape: {:?}", input.shape());
    println!("Output logits shape: {:?}", logits.shape());
    // Should be (10, 65) — 10 positions, 65 vocabulary scores each

    Ok(())
}

What to observe

The output shape should be (seq_len, vocab_size) — one set of logits per input position.
With random weights, the logits will be meaningless noise. Training (next section) will make them meaningful.
The model processes the entire sequence in parallel — this is the advantage of Transformers over RNNs.

Key takeaway: Our GPT model is built from composable structs: CausalSelfAttention, FeedForward, TransformerBlock, and Gpt1Model. Each handles one concern, and the candle crate provides the tensor operations and automatic differentiation we need for training.

Part 4 — Training

§10 Cross-Entropy Loss and the Training Loop

We have a model that takes token IDs and outputs logits. Now we need to teach it to output the right logits — the ones that predict the next token accurately. This is where training comes in.

The training objective

Recall that our model outputs logits of shape (seq_len, vocab_size) — a score for every token in the vocabulary, at every position. The training target is simple: at each position i, the correct next token is position i + 1 in the input.

Input:    [T, h, e,  , c, a, t]
Target:   [h, e,  , c, a, t, .]
Position:  0  1  2  3  4  5  6

At position 0, where the input is “T”, the model should predict “h”. At position 1, it should predict “e”, and so on. We shift the input by one to create the targets.

Cross-entropy loss

Cross-entropy loss measures how far the model’s predicted probability distribution is from the true distribution (where all probability mass is on the correct token).

For a single position where the correct token has index \( y \):

\[ \mathcal{L} = -\log P(y) = -\log \frac{e^{z_y}}{\sum_j e^{z_j}} \]

Where \( z_j \) are the logits. Intuitively:

If the model assigns high probability to the correct token, \( -\log P(y) \) is small (close to 0). Good.
If the model assigns low probability, \( -\log P(y) \) is large. Bad.

We average this loss over all positions in the sequence and all sequences in the batch.

A concrete example

Suppose our vocabulary is ['a', 'b', 'c'] and the correct next token is 'b' (index 1). The model outputs logits:

Logits:        [2.0, 5.0, 1.0]
After softmax: [0.05, 0.93, 0.02]    (e^2 / sum, e^5 / sum, e^1 / sum)
Loss:          -log(0.93) = 0.07      (low loss — model is confident and correct)

If the model were wrong:

Logits:        [5.0, 1.0, 2.0]
After softmax: [0.93, 0.02, 0.05]
Loss:          -log(0.02) = 3.91      (high loss — model is confident but wrong)

Gradient descent

To minimise the loss, we use gradient descent. The idea is:

Compute the loss for a batch of data.
Compute the gradient of the loss with respect to every model parameter (backpropagation).
Update each parameter by subtracting a small multiple of its gradient: \[ \theta \leftarrow \theta - \eta \nabla_\theta \mathcal{L} \] where \( \eta \) is the learning rate.

The learning rate is a critical hyperparameter:

Too high: training is unstable, loss oscillates or diverges.
Too low: training is painfully slow.
A typical starting value for small models: 1e-3 to 3e-4.

The training loop

The training loop repeats the following steps for many epochs (passes through the entire dataset):

for epoch in 1..=num_epochs:
    for batch in dataset:
        1. Forward pass: logits = model(input_tokens)
        2. Compute loss: loss = cross_entropy(logits, target_tokens)
        3. Backward pass: compute gradients via backpropagation
        4. Update weights: optimizer.step()
        5. Zero gradients: optimizer.zero_grad()
    print epoch loss

Batching

For efficiency, we process multiple sequences at once in a batch. Instead of feeding one sequence at a time, we stack batch_size sequences into a matrix:

Input shape: (batch_size, seq_len)
Output logits: (batch_size, seq_len, vocab_size)

For our small model training on CPU, a batch size of 32-64 works well.

The AdamW optimiser

We will use AdamW — a variant of the Adam optimiser with decoupled weight decay. Adam adapts the learning rate for each parameter based on the history of its gradients, which generally works much better than plain gradient descent. candle provides AdamW out of the box.

AdamW hyperparameters:
  learning_rate: 3e-4
  beta1: 0.9       (momentum)
  beta2: 0.999     (RMS of gradients)
  weight_decay: 0.1

Key takeaway: Cross-entropy loss measures how well the model’s predictions match the true next tokens. The training loop repeatedly computes this loss, computes gradients via backpropagation, and updates the model’s parameters to reduce the loss.

§11 Exercise 4: Train on a Small Text Corpus

In this exercise, we put everything together: load a text corpus, create training data, and train our model.

Preparing the data

For training data, we will use a small text corpus — a few kilobytes of Shakespeare works well. Create a file data/input.txt with some text, or use this approach to embed the data directly:

#![allow(unused)]
fn main() {
/// Load and prepare training data.
/// Returns (tokeniser, input_ids) where input_ids is the entire
/// corpus encoded as a vector of token IDs.
fn load_data() -> (CharTokeniser, Vec<u32>) {
    let text = "\
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!
";

    let tok = CharTokeniser::from_corpus(text);
    let ids = tok.encode(text);
    (tok, ids)
}
}

You can replace this with a longer text for better results. The more data, the more patterns the model can learn.

Creating batches

We need to extract fixed-length chunks from the corpus for training:

#![allow(unused)]
fn main() {
use candle_core::{DType, Device, Tensor};

/// Create a batch of (input, target) pairs from the corpus.
///
/// Each input is a sequence of `seq_len` tokens.
/// Each target is the same sequence shifted by one position.
fn create_batch(
    data: &[u32],
    batch_size: usize,
    seq_len: usize,
    device: &Device,
) -> anyhow::Result<(Tensor, Tensor)> {
    use rand::Rng;
    let mut rng = rand::thread_rng();
    let max_start = data.len() - seq_len - 1;

    let mut inputs = Vec::with_capacity(batch_size * seq_len);
    let mut targets = Vec::with_capacity(batch_size * seq_len);

    for _ in 0..batch_size {
        let start = rng.gen_range(0..max_start);
        for j in 0..seq_len {
            inputs.push(data[start + j]);
            targets.push(data[start + j + 1]);
        }
    }

    let inputs = Tensor::new(inputs.as_slice(), device)?
        .reshape((batch_size, seq_len))?;
    let targets = Tensor::new(targets.as_slice(), device)?
        .reshape((batch_size, seq_len))?;

    Ok((inputs, targets))
}
}

The training loop

Here is the complete training loop. Note that our model’s forward pass needs to be adjusted to handle batched input (a 2D tensor instead of 1D). For simplicity, we can process each sequence in the batch separately and stack the results:

#![allow(unused)]
fn main() {
use candle_nn::{AdamW, Optimizer, ParamsAdamW, VarMap, VarBuilder};

fn train() -> anyhow::Result<()> {
    let device = &Device::Cpu;
    let (tok, data) = load_data();

    println!("Corpus size: {} characters", data.len());
    println!("Vocabulary size: {}", tok.vocab_size());

    // Model setup
    let varmap = VarMap::new();
    let vb = VarBuilder::from_varmap(&varmap, DType::F32, device);
    let cfg = GptConfig::tiny(tok.vocab_size());
    let model = Gpt1Model::new(&cfg, vb)?;

    // Optimiser
    let params = ParamsAdamW {
        lr: 3e-4,
        weight_decay: 0.1,
        ..Default::default()
    };
    let mut opt = AdamW::new(varmap.all_vars(), params)?;

    // Training hyperparameters
    let batch_size = 16;
    let seq_len = 64;
    let num_steps = 1000;

    println!("\nTraining for {} steps...\n", num_steps);

    for step in 1..=num_steps {
        let (inputs, targets) = create_batch(&data, batch_size, seq_len, device)?;

        // Forward pass: process each sequence in the batch
        let mut all_logits = Vec::new();
        for b in 0..batch_size {
            let input_b = inputs.get(b)?; // (seq_len,)
            let logits_b = model.forward(&input_b)?; // (seq_len, vocab_size)
            all_logits.push(logits_b);
        }
        let logits = Tensor::stack(&all_logits, 0)?; // (batch, seq_len, vocab_size)

        // Reshape for cross-entropy: flatten batch and sequence dimensions
        let vocab_size = tok.vocab_size();
        let logits_flat = logits.reshape((batch_size * seq_len, vocab_size))?;
        let targets_flat = targets.reshape(batch_size * seq_len)?;

        // Cross-entropy loss
        let log_probs = candle_nn::ops::log_softmax(&logits_flat, D::Minus1)?;
        let targets_one_hot = targets_flat.to_dtype(DType::I64)?;
        let loss = candle_nn::loss::cross_entropy(&logits_flat, &targets_one_hot)?;

        // Backward pass + optimiser step
        opt.backward_step(&loss)?;

        if step % 100 == 0 || step == 1 {
            let loss_val: f32 = loss.to_scalar()?;
            println!("Step {:>4} | Loss: {:.4}", step, loss_val);
        }
    }

    println!("\nTraining complete!");
    Ok(())
}
}

Add `rand` to your dependencies

Update Cargo.toml:

[dependencies]
candle-core = "0.8"
candle-nn = "0.8"
anyhow = "1"
rand = "0.8"

Expected output

Corpus size: 482 characters
Vocabulary size: 42

Training for 1000 steps...

Step    1 | Loss: 3.7376
Step  100 | Loss: 2.8412
Step  200 | Loss: 2.3567
Step  300 | Loss: 2.0134
Step  400 | Loss: 1.8223
Step  500 | Loss: 1.6891
Step  600 | Loss: 1.5744
Step  700 | Loss: 1.4832
Step  800 | Loss: 1.4102
Step  900 | Loss: 1.3523
Step 1000 | Loss: 1.2987

Training complete!

The loss should decrease steadily. A random model starts with loss \( \approx \ln(\text{vocab_size}) \) (for 42 tokens, that is \( \ln(42) \approx 3.74 \)). As the model trains, it learns character patterns and the loss drops.

Tips for better results

Use more data. Even a few pages of Shakespeare (50KB+) will dramatically improve generation quality.
Train longer. 1000 steps is a minimum — try 5000 or 10000 for better results.
Adjust the learning rate. If loss plateaus, try reducing the learning rate.
Increase model size. With more data, you can increase d_model to 128 and n_layers to 4.

Key takeaway: The training loop repeatedly samples batches, computes the forward pass and cross-entropy loss, and updates weights via backpropagation. Watching the loss decrease is satisfying confirmation that the model is learning.

§12 Exercise 5: Sample from the Model

The payoff for all our work — generating text from the trained model. In this exercise, we implement temperature-based sampling and generate text character by character.

Temperature sampling

After the model produces logits for the next token, we convert them to probabilities using softmax. The temperature parameter controls the randomness of sampling:

\[ P(t_i) = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}} \]

Where \( T \) is the temperature:

Temperature	Effect
T < 1.0	Sharper distribution — model picks high-probability tokens more often. More deterministic, less creative.
T = 1.0	Unmodified distribution — sample directly from learned probabilities.
T > 1.0	Flatter distribution — lower-probability tokens get a bigger share. More random, more creative.
T → 0	Equivalent to argmax — always pick the most likely token.

Logits:        [2.0, 5.0, 1.0]

T = 1.0 → P: [0.05, 0.93, 0.02]    (normal)
T = 0.5 → P: [0.00, 1.00, 0.00]    (very peaked)
T = 2.0 → P: [0.18, 0.63, 0.19]    (flattened)

Top-k sampling

Top-k sampling restricts the choice to the k most probable tokens, setting all other probabilities to zero. This prevents the model from choosing extremely unlikely tokens (which can produce gibberish):

Logits (sorted):   [5.0, 3.0, 2.0, 0.5, -1.0, -3.0]
Top-k (k=3):       [5.0, 3.0, 2.0, -inf, -inf, -inf]
After softmax:     [0.67, 0.24, 0.09,  0.0,  0.0,  0.0]

Combining temperature and top-k is the standard approach in practice.

Implementation

Add a sampling function to your project:

#![allow(unused)]
fn main() {
use rand::distributions::Distribution;

/// Sample a token ID from logits with temperature and optional top-k.
///
/// # Arguments
/// * `logits` - 1D tensor of shape `(vocab_size,)` — raw model output
/// * `temperature` - Controls randomness (lower = more deterministic)
/// * `top_k` - If Some(k), only consider the top k most likely tokens
fn sample_token(
    logits: &Tensor,
    temperature: f64,
    top_k: Option<usize>,
) -> anyhow::Result<u32> {
    let device = logits.device();
    let vocab_size = logits.dim(0)?;

    // Apply temperature
    let scaled = if temperature != 1.0 {
        (logits / temperature)?
    } else {
        logits.clone()
    };

    // Convert to Vec for manipulation
    let mut logit_vec: Vec<f32> = scaled.to_vec1()?;

    // Apply top-k: set everything outside top-k to -inf
    if let Some(k) = top_k {
        let k = k.min(vocab_size);
        let mut indexed: Vec<(usize, f32)> = logit_vec.iter()
            .enumerate()
            .map(|(i, &v)| (i, v))
            .collect();
        indexed.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());
        let threshold = indexed[k - 1].1;
        for val in logit_vec.iter_mut() {
            if *val < threshold {
                *val = f32::NEG_INFINITY;
            }
        }
    }

    // Softmax to get probabilities
    let max_val = logit_vec.iter().cloned().fold(f32::NEG_INFINITY, f32::max);
    let exps: Vec<f32> = logit_vec.iter().map(|&x| (x - max_val).exp()).collect();
    let sum: f32 = exps.iter().sum();
    let probs: Vec<f64> = exps.iter().map(|&x| (x / sum) as f64).collect();

    // Sample from the distribution
    let dist = rand::distributions::WeightedIndex::new(&probs)?;
    let mut rng = rand::thread_rng();
    Ok(dist.sample(&mut rng) as u32)
}
}

The generation loop

#![allow(unused)]
fn main() {
/// Generate text from the model autoregressively.
///
/// # Arguments
/// * `model` - The trained GPT model
/// * `tok` - The tokeniser
/// * `prompt` - Starting text
/// * `max_tokens` - Number of tokens to generate
/// * `temperature` - Sampling temperature
/// * `top_k` - Optional top-k filtering
fn generate(
    model: &Gpt1Model,
    tok: &CharTokeniser,
    prompt: &str,
    max_tokens: usize,
    temperature: f64,
    top_k: Option<usize>,
) -> anyhow::Result<String> {
    let device = &Device::Cpu;

    // Encode the prompt
    let mut token_ids = tok.encode(prompt);
    let max_seq_len = 128; // Must match model config

    for _ in 0..max_tokens {
        // Truncate to max_seq_len if needed (keep most recent tokens)
        let start = if token_ids.len() > max_seq_len {
            token_ids.len() - max_seq_len
        } else {
            0
        };
        let context = &token_ids[start..];

        // Forward pass
        let input = Tensor::new(context, device)?;
        let logits = model.forward(&input)?;

        // Get logits for the last position
        let last_logits = logits.get(context.len() - 1)?; // (vocab_size,)

        // Sample next token
        let next_token = sample_token(&last_logits, temperature, top_k)?;
        token_ids.push(next_token);
    }

    Ok(tok.decode(&token_ids))
}
}

Putting it together

After training, add generation:

fn main() -> anyhow::Result<()> {
    // ... training code from Exercise 4 ...

    println!("\n--- Generating text ---\n");

    // Try different temperatures
    for temp in [0.5, 0.8, 1.0, 1.5] {
        let text = generate(&model, &tok, "First", 200, temp, Some(10))?;
        println!("Temperature {:.1}:", temp);
        println!("{}\n", text);
    }

    Ok(())
}

Expected output

With a well-trained model on Shakespeare text:

Temperature 0.5:
First Citizen:
Before we proceed any further, hear me speak.
All:
Speak, speak.

Temperature 1.0:
First Citizen:
Let us know the people are resolved to
the corn at our own kill him, and we
speak.

Temperature 1.5:
First Civkzl:
aNo moye, ws arl't; he proceiw
Le usdn ferktie corn at mork

At low temperature, the model reproduces memorised text. At high temperature, it becomes creative but error-prone. Temperature 0.8-1.0 is usually the sweet spot.

Exercises to try

Experiment with top-k. Try k = 1 (greedy), k = 5, k = 10, and None (no filtering). How does it affect output quality?
Implement top-p (nucleus) sampling. Instead of a fixed k, include tokens until their cumulative probability exceeds a threshold p (e.g., 0.9).
Try different prompts. How does the model respond to prompts it has seen vs. novel prompts?
Measure perplexity. Compute \( e^{\text{average loss}} \) on a held-out test set to quantify model quality.

Key takeaway: Text generation works by repeatedly running the model, sampling a token from the output distribution, and appending it. Temperature and top-k control the tradeoff between coherence and creativity.

Part 5 — Reflection

§13 What Limits This Model?

We have built a working language model — it can learn patterns in text and generate new text. But if you compare its output to ChatGPT or Claude, the gap is enormous. Let us understand why.

Context length

Our model has a maximum context window of 128 tokens (characters). It literally cannot “see” anything beyond the last 128 characters. Modern LLMs have context windows of 8K to 200K+ tokens (and those are subword tokens, each representing several characters). This means our model:

Cannot maintain coherence over long passages.
Cannot reason about information that appeared more than a few words ago.
Has no ability to follow instructions that exceed its window.

Model size

Our model has roughly 100,000 parameters. For comparison:

Model	Parameters	Ratio to ours
Our model	~100K	1x
GPT-1 (2018)	117M	1,170x
GPT-2 (2019)	1.5B	15,000x
GPT-3 (2020)	175B	1,750,000x
GPT-4 (2023)	~1.8T (rumoured)	~18,000,000x

With only 100K parameters, our model can memorise short character patterns but cannot learn grammar, semantics, or world knowledge. Larger models have more capacity to store and compose information.

Training data

We trained on a few hundred characters. Real LLMs train on trillions of tokens — essentially the entire public internet, books, code repositories, and more. The sheer volume and diversity of data is what allows large models to:

Learn the structure of many languages.
Absorb factual knowledge.
Understand code, math, and reasoning patterns.

Tokenisation

Character-level tokenisation means our model sees one character per step. A 10-word sentence is ~50 tokens for us, but only ~10-15 tokens for a BPE tokeniser. This means:

Our model needs a longer context window for the same effective range.
Longer sequences are more expensive to process (attention is \( O(n^2) \) in sequence length).
The model must learn to spell — it cannot take words as atomic units.

Real LLMs use BPE (GPT) or SentencePiece (Llama) tokenisers with vocabularies of 32K-100K tokens.

Training techniques we skipped

Production LLMs use many techniques we did not cover:

Learning rate scheduling. A warm-up phase followed by cosine decay.
Gradient clipping. Preventing exploding gradients by capping their magnitude.
Mixed precision training. Using float16/bfloat16 for speed and memory efficiency.
Data parallelism and model parallelism. Distributing training across hundreds of GPUs.
RLHF (Reinforcement Learning from Human Feedback). Fine-tuning the model to follow instructions and be helpful, using human preference data. This is what makes ChatGPT and Claude conversational, rather than just completing text.
Supervised fine-tuning (SFT). Training on curated instruction-response pairs before RLHF.

What our model CAN do

Despite its limitations, our model demonstrates every fundamental component of a modern LLM:

Tokenisation — converting text to numbers and back.
Embeddings — learned vector representations of tokens and positions.
Self-attention with causal masking — the core Transformer mechanism.
Stacked Transformer blocks — depth through repeated application.
Cross-entropy training — learning from next-token prediction.
Autoregressive generation — producing text one token at a time.

The jump from our model to GPT-4 is primarily one of scale — more parameters, more data, more compute — plus careful engineering and alignment techniques. The architecture is fundamentally the same.

Key takeaway: Our model is limited by context length, model size, training data, and tokenisation. But it contains every core component of production LLMs. The path from here to GPT-4 is primarily scaling and engineering, not architectural revolution.

§14 Further Reading

This chapter covered the foundations. Here are resources to go deeper, organised by topic.

Foundational papers

“Attention Is All You Need” (Vaswani et al., 2017) — arxiv.org/abs/1706.03762 — The paper that introduced the Transformer architecture. Essential reading.
“Improving Language Understanding by Generative Pre-Training” (Radford et al., 2018) — CDN link — The GPT-1 paper. Showed that a decoder-only Transformer pre-trained on language modeling can be fine-tuned for many tasks.
“Language Models are Unsupervised Multitask Learners” (Radford et al., 2019) — CDN link — The GPT-2 paper. Showed that scaling up GPT-1 leads to emergent few-shot abilities.
“Language Models are Few-Shot Learners” (Brown et al., 2020) — arxiv.org/abs/2005.14165 — The GPT-3 paper. Demonstrated that massive scale enables in-context learning.

Tutorials and courses

Andrej Karpathy’s “Let’s build GPT” — youtube.com/watch?v=kCc8FmEb1nY — A two-hour video building a GPT from scratch in Python/PyTorch. Excellent companion to this chapter, covering the same ideas in a different language.
Andrej Karpathy’s “makemore” series — Builds character-level language models of increasing complexity, from bigrams to Transformers. Available on YouTube.
3Blue1Brown “But what is a neural network?” — youtube.com/watch?v=aircAruvnKk — Beautiful visual explanations of the basics of neural networks, backpropagation, and gradient descent.
“The Illustrated Transformer” by Jay Alammar — jalammar.github.io/illustrated-transformer/ — The best visual guide to the Transformer architecture.

Rust ML ecosystem

Candle — github.com/huggingface/candle — The tensor framework we used in this chapter. Supports CPU and GPU, with a PyTorch-like API.
Candle documentation — docs.rs/candle-core — API reference for tensor operations.
Burn — burn.dev — Another Rust deep learning framework, with a different design philosophy (backend-agnostic).
tch-rs — github.com/LaurentMazare/tch-rs — Rust bindings for PyTorch’s C++ library (libtorch). More mature but requires a C++ dependency.

Books

“Deep Learning” by Goodfellow, Bengio, and Courville — deeplearningbook.org — The comprehensive textbook on deep learning fundamentals.
“Dive into Deep Learning” — d2l.ai — Interactive, code-first textbook with implementations in multiple frameworks.
“Speech and Language Processing” by Jurafsky and Martin — web.stanford.edu/~jurafsky/slp3/ — Covers NLP foundations including language modeling, with chapters on neural approaches.

Topics to explore next

Now that you understand the basics, here are natural next steps:

Subword tokenisation. Implement BPE (Byte-Pair Encoding) to handle larger vocabularies efficiently. See the tokenizers crate by Hugging Face.
GPU training. Switch from Device::Cpu to Device::Cuda in candle to train on a GPU. This enables much larger models and datasets.
Positional encodings. Experiment with RoPE (Rotary Position Embeddings), which is used in Llama and most modern models.
KV caching. During generation, cache the key and value tensors from previous tokens to avoid redundant computation. This is essential for fast inference.
Fine-tuning a pre-trained model. Load a pre-trained model (e.g., a small Llama) in candle and fine-tune it on your own data.
RLHF. Study how reinforcement learning from human feedback transforms a language model into an assistant.

Key takeaway: The field of language modeling is vast and evolving rapidly. The fundamentals you learned in this chapter — tokenisation, embeddings, attention, training — are the foundation everything else builds on. Pick a direction that interests you and keep building.

Keyboard shortcuts

Vibed Learning