Shader Programming with wgpu and WGSL

This document is a self-guided course on GPU shader programming. It is organised into six parts: the GPU execution model, setting up with wgpu, vertex and fragment shaders, textures and samplers, compute shaders, and a look at where to go next. Each section is either a reading lesson or a hands-on Rust programming exercise.

Part 1 — The GPU and the Graphics Pipeline

CPU vs GPU: parallel execution model
The programmable pipeline: vertex, fragment, compute shaders
What is WGSL? Syntax overview

Part 2 — Setting Up with wgpu

What is wgpu? Cross-platform graphics API in Rust
Exercise 1: create a window and clear it to a colour
The render loop: swap chains, frames, command encoders

Part 3 — Vertex and Fragment Shaders

Vertices, buffers, and the vertex shader
Interpolation and the fragment shader
Exercise 2: draw a coloured triangle
Exercise 3: animate the triangle using a time uniform

Part 4 — Textures and Samplers

Texture coordinates (UVs), texture creation, sampler config
Exercise 4: render a textured quad

Part 5 — Compute Shaders

Compute pipelines: dispatching work groups
Storage buffers and read/write access from WGSL
Exercise 5: GPU-accelerate a particle simulation

Part 6 — Going Further

Post-processing effects (bloom, blur): conceptual overview
Signed Distance Fields for font rendering
Resources: Learn WGPU, Shadertoy, The Book of Shaders

Part 1 — The GPU and the Graphics Pipeline

1. CPU vs GPU: parallel execution model

To understand shader programming, you first need to understand why GPUs exist and how they differ from CPUs. The core difference comes down to a design trade-off: latency vs throughput.

The CPU: a few powerful cores

A modern CPU has a small number of cores — typically 4 to 16 on a consumer chip. Each core is highly sophisticated: it has deep pipelines, branch predictors, out-of-order execution, and large caches. This design makes each individual core extremely fast at executing a single sequence of instructions.

CPU (8 cores)
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│  Core 0  │ │  Core 1  │ │  Core 2  │ │  Core 3  │
│ (complex)│ │ (complex)│ │ (complex)│ │ (complex)│
│ OoO exec │ │ OoO exec │ │ OoO exec │ │ OoO exec │
│ Branch   │ │ Branch   │ │ Branch   │ │ Branch   │
│ pred.    │ │ pred.    │ │ pred.    │ │ pred.    │
│ L1/L2    │ │ L1/L2    │ │ L1/L2    │ │ L1/L2    │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│  Core 4  │ │  Core 5  │ │  Core 6  │ │  Core 7  │
│ (complex)│ │ (complex)│ │ (complex)│ │ (complex)│
└──────────┘ └──────────┘ └──────────┘ └──────────┘

CPUs are optimised for low latency — finishing any single task as quickly as possible. This makes them ideal for general-purpose programming: parsing JSON, running game logic, managing operating system tasks.

The GPU: thousands of simple cores

A GPU takes the opposite approach. It packs thousands of tiny, simple cores onto a single chip. Each individual core is much less powerful than a CPU core — no branch prediction, no out-of-order execution, minimal cache. But there are so many of them that the total throughput is enormous.

GPU (thousands of cores)
┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
│ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │
├───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┤
│ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │
├───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┤
│ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │
├───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┤
│ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │
├───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┤
│ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │
├───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┤
│ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │
└───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
  Each · is a simple core. Thousands execute in parallel.

GPUs are optimised for high throughput — processing millions of similar operations per second. Each individual operation might be slower than on a CPU, but the sheer volume of parallel work makes up for it.

SIMD vs SIMT

You may have heard of SIMD (Single Instruction, Multiple Data) on CPUs — instructions like SSE or AVX that process 4 or 8 values at once in a single register. GPUs take this idea much further with SIMT (Single Instruction, Multiple Threads).

In SIMT, groups of threads (called warps on NVIDIA or wavefronts on AMD) execute the same instruction at the same time, but each thread operates on different data. A typical warp is 32 threads wide.

SIMT execution (one warp of 32 threads):

  Instruction: multiply position by matrix

  Thread 0:  vertex[0].pos * matrix  →  result[0]
  Thread 1:  vertex[1].pos * matrix  →  result[1]
  Thread 2:  vertex[2].pos * matrix  →  result[2]
  ...
  Thread 31: vertex[31].pos * matrix →  result[31]

  All 32 threads execute the same multiply instruction
  at the same clock cycle, on different vertex data.

This is why GPUs are perfect for graphics: every pixel on screen needs the same computation (run the fragment shader), just with different input coordinates. The same applies to vertex transformations, physics simulations, and many other tasks.

When does the GPU win?

The GPU excels when your problem has these characteristics:

Data parallelism: the same operation is applied to many independent data elements
Arithmetic intensity: lots of math per memory access
Predictable control flow: minimal branching (if/else) since all threads in a warp must take the same path

Problems that are sequential, branch-heavy, or have complex data dependencies are better left on the CPU.

Key takeaway: CPUs are fast race cars — great at finishing one task quickly. GPUs are cargo ships — slower per trip, but they move enormous amounts of freight in parallel. Shader programming is the art of loading that cargo ship efficiently.

2. The programmable pipeline: vertex, fragment, compute shaders

Modern GPUs run a programmable graphics pipeline — a fixed sequence of stages where some stages run programs you write (shaders) and others are handled automatically by the hardware. Understanding this pipeline is essential before writing any shader code.

The graphics pipeline

When you ask the GPU to draw a triangle, your data flows through several stages:

                    The Graphics Pipeline
                    =====================

  CPU (your Rust code)
    │
    │  Vertex data + draw call
    ▼
┌─────────────────────┐
│   VERTEX SHADER     │  ◄── Programmable (you write this)
│   Transforms each   │      Runs once per vertex
│   vertex position   │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│   PRIMITIVE         │  ◄── Fixed-function (hardware)
│   ASSEMBLY          │      Connects vertices into
│                     │      triangles, lines, or points
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│   RASTERISATION     │  ◄── Fixed-function (hardware)
│   Determines which  │      Converts triangles into
│   pixels a triangle │      fragments (candidate pixels)
│   covers            │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│   FRAGMENT SHADER   │  ◄── Programmable (you write this)
│   Computes the      │      Runs once per fragment
│   colour of each    │      (potential pixel)
│   fragment          │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│   OUTPUT MERGER     │  ◄── Fixed-function (hardware)
│   Depth test, blend │      Combines fragments into
│   with framebuffer  │      the final image
└─────────────────────┘

The vertex shader

The vertex shader runs once for every vertex you submit. Its primary job is to transform vertex positions from model space (the coordinates you defined your mesh in) to clip space (the coordinate system the GPU uses to determine what is on screen).

A vertex shader typically receives input data — position, colour, texture coordinates — and outputs a transformed position plus any data that should be passed to the fragment shader.

For example, a vertex shader might:

Multiply the vertex position by a model-view-projection matrix
Pass the vertex colour through to the next stage
Compute lighting values at each vertex

Rasterisation

After the vertex shader runs and the GPU assembles vertices into triangles, rasterisation determines which screen pixels each triangle covers. This is not programmable — the hardware handles it automatically.

For each pixel covered by a triangle, the rasteriser generates a fragment. A fragment is a candidate pixel — it carries interpolated values from the triangle’s vertices (we will explore interpolation in detail in section 8).

The fragment shader

The fragment shader runs once for every fragment produced by rasterisation. Its job is to determine the final colour of that pixel. This is where most of the visual magic happens: texturing, lighting, shadows, reflections, and special effects are all implemented in the fragment shader.

The fragment shader receives interpolated data from the vertex shader (like colour or texture coordinates) and outputs a colour value, typically as an RGBA (red, green, blue, alpha) tuple.

Compute shaders: a separate path

Compute shaders do not participate in the graphics pipeline at all. They are general-purpose programs that run on the GPU, independent of any rendering. You dispatch them with explicit work-group sizes and they can read from and write to buffers and textures.

  Compute Pipeline (independent of graphics)
  ==========================================

  CPU (your Rust code)
    │
    │  Dispatch (work group counts)
    ▼
┌─────────────────────┐
│   COMPUTE SHADER    │  ◄── Programmable (you write this)
│   General-purpose   │      Runs once per invocation
│   parallel work     │      across work groups
└─────────────────────┘
    │
    ▼
  Output buffers / textures

Compute shaders are used for physics simulations, image processing, machine learning inference, procedural generation, and any task that benefits from massive parallelism but does not need the rasterisation pipeline.

Key takeaway: The GPU has two paths for running your code. The graphics pipeline flows from vertex shader through rasterisation to fragment shader, producing pixels on screen. The compute pipeline is a separate, general-purpose path for parallel computation. You will write programs for all three shader types in this course.

3. What is WGSL? Syntax overview

WGSL (WebGPU Shading Language) is the shader language used by the WebGPU API — and by extension, by wgpu. If you have used GLSL or HLSL before, WGSL will feel familiar but with a more explicit, Rust-influenced syntax. If you are new to shader languages, this section covers everything you need to get started.

Scalar types

WGSL provides a small set of scalar types:

Type	Description
`f32`	32-bit floating point
`f16`	16-bit floating point (optional feature)
`i32`	32-bit signed integer
`u32`	32-bit unsigned integer
`bool`	Boolean

Vector types

Vectors are fundamental in shader programming. WGSL supports vectors of 2, 3, or 4 components:

var a: vec2<f32> = vec2<f32>(1.0, 2.0);
var b: vec3<f32> = vec3<f32>(1.0, 0.0, 0.0);  // a red colour or a direction
var c: vec4<f32> = vec4<f32>(0.2, 0.4, 0.8, 1.0);  // RGBA colour

// Shorthand constructors (type inference):
var d = vec3f(1.0, 0.0, 0.0);  // vec3<f32>
var e = vec4f(0.0, 0.0, 0.0, 1.0);

You can access components with swizzling:

var color = vec4f(1.0, 0.5, 0.2, 1.0);
var rgb = color.rgb;   // vec3f(1.0, 0.5, 0.2)
var rr  = color.xx;    // vec2f(1.0, 1.0)

Components can be accessed as x/y/z/w or r/g/b/a — they are interchangeable aliases.

Matrix types

Matrices are used for transformations (rotation, scaling, projection):

// A 4x4 matrix of f32 values (4 columns, 4 rows)
var transform: mat4x4<f32>;

// A 3x3 matrix
var rotation: mat3x3<f32>;

Matrix-vector multiplication uses the * operator: transform * vec4f(pos, 1.0).

Variables: `let` vs `var`

// `let` declares an immutable binding (like Rust's `let`)
let pi = 3.14159;

// `var` declares a mutable variable
var counter: u32 = 0u;
counter = counter + 1u;

Structs

Structs group related data, and they are used extensively for shader inputs and outputs:

struct VertexInput {
    @location(0) position: vec3<f32>,
    @location(1) color: vec3<f32>,
}

struct VertexOutput {
    @builtin(position) clip_position: vec4<f32>,
    @location(0) color: vec3<f32>,
}

The @location(n) attribute links struct fields to specific slots in the vertex buffer layout or inter-stage communication. The @builtin(position) attribute tells the GPU this field is the clip-space position.

Functions and entry points

WGSL functions look like this:

fn add(a: f32, b: f32) -> f32 {
    return a + b;
}

Entry points are functions marked with a stage attribute:

@vertex
fn vs_main(in: VertexInput) -> VertexOutput {
    var out: VertexOutput;
    out.clip_position = vec4f(in.position, 1.0);
    out.color = in.color;
    return out;
}

@fragment
fn fs_main(in: VertexOutput) -> @location(0) vec4<f32> {
    return vec4f(in.color, 1.0);
}

@compute @workgroup_size(64)
fn cs_main(@builtin(global_invocation_id) id: vec3<u32>) {
    // compute work here
}

The @location(0) on the fragment shader return type means “write to the first colour attachment” (the render target).

Built-in attributes

Some commonly used built-in attributes:

Attribute	Stage	Meaning
`@builtin(position)`	Vertex out / Fragment in	Clip-space position / fragment coordinates
`@builtin(vertex_index)`	Vertex	Index of the current vertex
`@builtin(instance_index)`	Vertex	Index of the current instance
`@builtin(global_invocation_id)`	Compute	3D index of this thread in the dispatch
`@builtin(local_invocation_id)`	Compute	3D index within the work group

Binding resources

Uniforms, storage buffers, textures, and samplers are declared at module scope with @group and @binding attributes:

@group(0) @binding(0)
var<uniform> time: f32;

@group(0) @binding(1)
var texture: texture_2d<f32>;

@group(0) @binding(2)
var tex_sampler: sampler;

The @group(n) corresponds to a bind group index, and @binding(n) is the binding within that group. These must match the bind group layout you define on the Rust side.

Control flow

WGSL supports if/else, for, while, loop, switch, break, continue, and return:

for (var i: u32 = 0u; i < 10u; i = i + 1u) {
    if i == 5u {
        continue;
    }
    // do work
}

Key takeaway: WGSL’s syntax is a blend of Rust and C-family languages. Types are explicit, entry points are marked with stage attributes (@vertex, @fragment, @compute), and data flows between stages via structs annotated with @location and @builtin. You will write WGSL for every exercise in this course.

Part 2 — Setting Up with wgpu

4. What is wgpu? Cross-platform graphics API in Rust

wgpu is a Rust crate that implements the WebGPU API specification. It provides a safe, cross-platform interface for GPU programming that works on multiple backends:

Backend	Platform
Vulkan	Linux, Windows, Android
Metal	macOS, iOS
DX12	Windows
WebGPU	Web browsers (via wasm)
OpenGL	Fallback for older systems

This means you write your GPU code once and it runs everywhere — on desktop, on mobile, and in the browser.

Why not raw Vulkan/Metal/DX12?

Writing directly against a low-level graphics API like Vulkan requires thousands of lines of boilerplate before you can draw a single triangle. Vulkan’s explicit nature gives you maximum control, but the complexity is enormous. wgpu provides a higher-level abstraction that handles the platform differences and much of the boilerplate while still being close enough to the metal for serious work.

Key types in wgpu

Here are the core types you will interact with, in the order you typically create them:

Initialization Flow
====================

  Instance
    │
    │  enumerate adapters
    ▼
  Adapter        ←── represents a physical GPU
    │
    │  request device
    ▼
  Device  +  Queue
    │           │
    │           │  submit commands
    │           ▼
    │        (GPU execution)
    │
    │  create resources
    ▼
  Buffers, Textures, Pipelines, Bind Groups, ...

Instance: the entry point to wgpu. Created first, used to find adapters and create surfaces.
Surface: a handle to a window’s drawable area. Created from a window (provided by a windowing library like winit).
Adapter: represents a physical GPU. You request one from the instance, optionally specifying preferences (power preference, compatibility with your surface).
Device: a logical connection to the GPU. You create resources (buffers, textures, pipelines) through the device. Think of it as an open connection to the GPU.
Queue: used to submit work (command buffers) to the GPU. You get a queue together with the device.
CommandEncoder: records GPU commands (render passes, compute dispatches, buffer copies) into a command buffer. The command buffer is then submitted to the queue.
RenderPipeline: describes the full configuration for rendering — which shaders to use, vertex layout, blending mode, pixel format, etc.
Buffer: a block of GPU-accessible memory. Used for vertex data, index data, uniforms, storage, etc.
BindGroup: a collection of resources (buffers, textures, samplers) that are made available to shaders. Corresponds to @group(n) in WGSL.

The initialisation sequence in code

Here is a simplified view of wgpu initialisation (we will see the full code in Exercise 1):

#![allow(unused)]
fn main() {
// 1. Create an instance
let instance = wgpu::Instance::new(&wgpu::InstanceDescriptor::default());

// 2. Create a surface from a window
let surface = instance.create_surface(&window)?;

// 3. Request an adapter (physical GPU)
let adapter = instance
    .request_adapter(&wgpu::RequestAdapterOptions {
        power_preference: wgpu::PowerPreference::default(),
        compatible_surface: Some(&surface),
        force_fallback_adapter: false,
    })
    .await
    .unwrap();

// 4. Request a device and queue
let (device, queue) = adapter
    .request_device(&wgpu::DeviceDescriptor::default(), None)
    .await
    .unwrap();

// 5. Configure the surface
let config = surface.get_default_config(&adapter, width, height).unwrap();
surface.configure(&device, &config);
}

After this, you are ready to create pipelines, buffers, and start rendering.

Key takeaway: wgpu is a cross-platform GPU abstraction for Rust. You create an Instance, get an Adapter (physical GPU), open a Device + Queue, and then create resources and submit commands. This same code works on Vulkan, Metal, DX12, and WebGPU.

5. Exercise 1: create a window and clear it to a colour

In this exercise you will create a window using winit, initialise wgpu, and fill the window with a solid colour (cornflower blue). This is the “hello world” of GPU programming.

Step 1: project setup

Create a new Rust project and add the required dependencies to Cargo.toml:

[package]
name = "shader-exercises"
version = "0.1.0"
edition = "2021"

[dependencies]
wgpu = "24"
winit = "30"
pollster = "0.4"
log = "0.4"
env_logger = "0.11"

[profile.release]
opt-level = "z"
lto = true
strip = true
codegen-units = 1

wgpu: the GPU abstraction layer
winit: cross-platform window creation and event handling
pollster: a minimal async executor to block on futures (wgpu uses async for initialisation)
env_logger: so wgpu can report errors and warnings

Step 2: the complete code

use winit::{
    application::ApplicationHandler,
    event::WindowEvent,
    event_loop::EventLoop,
    window::{Window, WindowAttributes},
};
use std::sync::Arc;

/// Holds all wgpu state needed for rendering.
struct GpuState {
    surface: wgpu::Surface<'static>,
    device: wgpu::Device,
    queue: wgpu::Queue,
    config: wgpu::SurfaceConfiguration,
}

/// The main application struct.
struct App {
    window: Option<Arc<Window>>,
    gpu: Option<GpuState>,
}

impl App {
    fn new() -> Self {
        Self {
            window: None,
            gpu: None,
        }
    }

    /// Initialise wgpu with the given window.
    fn init_gpu(&mut self, window: Arc<Window>) {
        let size = window.inner_size();
        let instance = wgpu::Instance::new(&wgpu::InstanceDescriptor::default());

        let surface = instance.create_surface(window.clone()).unwrap();

        let adapter = pollster::block_on(instance.request_adapter(
            &wgpu::RequestAdapterOptions {
                power_preference: wgpu::PowerPreference::default(),
                compatible_surface: Some(&surface),
                force_fallback_adapter: false,
            },
        ))
        .expect("Failed to find a suitable GPU adapter");

        let (device, queue) = pollster::block_on(adapter.request_device(
            &wgpu::DeviceDescriptor::default(),
            None,
        ))
        .expect("Failed to create device");

        let config = surface
            .get_default_config(&adapter, size.width.max(1), size.height.max(1))
            .expect("Surface is not supported by the adapter");
        surface.configure(&device, &config);

        self.gpu = Some(GpuState {
            surface,
            device,
            queue,
            config,
        });
    }

    /// Render a single frame: clear the screen to cornflower blue.
    fn render(&self) {
        let gpu = self.gpu.as_ref().unwrap();

        // Get the next frame's texture to draw on
        let output = gpu.surface.get_current_texture()
            .expect("Failed to get surface texture");
        let view = output.texture.create_view(&Default::default());

        // Create a command encoder to record GPU commands
        let mut encoder = gpu.device.create_command_encoder(
            &wgpu::CommandEncoderDescriptor {
                label: Some("Clear Encoder"),
            },
        );

        // Begin a render pass that clears to cornflower blue
        {
            let _render_pass = encoder.begin_render_pass(
                &wgpu::RenderPassDescriptor {
                    label: Some("Clear Pass"),
                    color_attachments: &[Some(
                        wgpu::RenderPassColorAttachment {
                            view: &view,
                            resolve_target: None,
                            ops: wgpu::Operations {
                                load: wgpu::LoadOp::Clear(
                                    wgpu::Color {
                                        r: 0.392,
                                        g: 0.584,
                                        b: 0.929,
                                        a: 1.0,
                                    },
                                ),
                                store: wgpu::StoreOp::Store,
                            },
                        },
                    )],
                    depth_stencil_attachment: None,
                    ..Default::default()
                },
            );
            // The render pass is dropped here, ending it
        }

        // Submit the commands to the GPU
        gpu.queue.submit(std::iter::once(encoder.finish()));

        // Present the frame on screen
        output.present();
    }
}

impl ApplicationHandler for App {
    fn resumed(&mut self, event_loop: &winit::event_loop::ActiveEventLoop) {
        if self.window.is_none() {
            let attrs = WindowAttributes::default()
                .with_title("Exercise 1: Cornflower Blue");
            let window = Arc::new(
                event_loop.create_window(attrs).unwrap()
            );
            self.init_gpu(window.clone());
            self.window = Some(window);
        }
    }

    fn window_event(
        &mut self,
        event_loop: &winit::event_loop::ActiveEventLoop,
        _window_id: winit::window::WindowId,
        event: WindowEvent,
    ) {
        match event {
            WindowEvent::CloseRequested => {
                event_loop.exit();
            }
            WindowEvent::Resized(new_size) => {
                if let Some(gpu) = &mut self.gpu {
                    gpu.config.width = new_size.width.max(1);
                    gpu.config.height = new_size.height.max(1);
                    gpu.surface.configure(&gpu.device, &gpu.config);
                }
            }
            WindowEvent::RedrawRequested => {
                self.render();
                if let Some(window) = &self.window {
                    window.request_redraw();
                }
            }
            _ => {}
        }
    }
}

fn main() {
    env_logger::init();
    let event_loop = EventLoop::new().unwrap();
    let mut app = App::new();
    event_loop.run_app(&mut app).unwrap();
}

Step 3: run it

cargo run

You should see a window filled with cornflower blue (a pleasant mid-blue, rgb(100, 149, 237)). The window responds to resizing and closes when you click the close button.

What just happened?

Let’s break down the key parts:

Window creation: winit creates a native window. We wrap it in Arc so wgpu can reference it.
Surface: created from the window — this is where rendered frames go.
Adapter + Device + Queue: we find a GPU, open a logical device, and get a command queue.
Surface configuration: tells the surface what pixel format and size to use.
Render loop: every frame we create a CommandEncoder, begin a RenderPass with a clear colour, end the pass, submit commands, and present.

The clear colour is specified as wgpu::Color { r, g, b, a } with values in the 0.0-1.0 range.

Try this: change the colour to something else — pure red (1.0, 0.0, 0.0, 1.0), bright green, or your favourite colour. Rebuild and see the change.

6. The render loop: swap chains, frames, command encoders

Now that you have a working window, let’s dive deeper into what happens each frame. Understanding the render loop is crucial because every shader program you write will run inside this cycle.

The frame lifecycle

Every frame follows the same sequence. Here is what happens between one screen update and the next:

Frame Lifecycle
===============

  Time ──────────────────────────────────────────────────►

  ┌──── Frame N ─────────────────────┐  ┌── Frame N+1 ──
  │                                  │  │
  │  1. Acquire     2. Record     3. Submit   4. Present
  │     surface        commands      to           to
  │     texture        (render       queue        screen
  │                     pass)
  │                                  │  │
  │     CPU              CPU         │  GPU executes
  │     side             side        │  asynchronously
  └──────────────────────────────────┘  └────────────────

Step 1: acquire a surface texture

#![allow(unused)]
fn main() {
let output = surface.get_current_texture()?;
let view = output.texture.create_view(&Default::default());
}

The surface manages a small pool of textures (typically 2-3, called a swap chain). When you call get_current_texture(), you receive the next available texture to draw on. While you are drawing on texture A, the GPU may still be displaying the previous texture B on screen — this is double buffering.

Double Buffering
================

  ┌────────────┐        ┌────────────┐
  │ Texture A  │        │ Texture B  │
  │ (drawing)  │        │ (on screen)│
  └────────────┘        └────────────┘
       ▲                      ▲
       │                      │
    You render             Monitor
    into this              displays
    one now                this one

After you present texture A, the roles swap: A goes to the screen and B becomes available for the next frame.

Step 2: record commands with a command encoder

#![allow(unused)]
fn main() {
let mut encoder = device.create_command_encoder(&Default::default());
}

The CommandEncoder is like a tape recorder for GPU commands. You do not execute anything immediately — you record a list of operations, and then submit them all at once. This is called a command buffer model.

Why not execute commands immediately? Because the GPU operates asynchronously. Batching commands into a buffer lets the GPU execute them efficiently without constant back-and-forth with the CPU.

Step 3: begin a render pass

#![allow(unused)]
fn main() {
let render_pass = encoder.begin_render_pass(&wgpu::RenderPassDescriptor {
    color_attachments: &[Some(wgpu::RenderPassColorAttachment {
        view: &view,
        ops: wgpu::Operations {
            load: wgpu::LoadOp::Clear(clear_color),
            store: wgpu::StoreOp::Store,
        },
        ..Default::default()
    })],
    ..Default::default()
});
}

A render pass is a sequence of draw commands that all target the same set of attachments (colour textures, depth buffers). Within a render pass, you:

Set the pipeline
Bind vertex buffers and bind groups
Issue draw calls

The load operation specifies what happens to the attachment at the start of the pass. LoadOp::Clear(color) fills it with a solid colour. LoadOp::Load preserves the previous contents.

The store operation specifies what happens at the end. StoreOp::Store keeps the results; StoreOp::Discard throws them away (useful for depth buffers you do not need after the pass).

Step 4: submit and present

#![allow(unused)]
fn main() {
// End the render pass (drop it)
drop(render_pass);

// Finish recording and get a command buffer
let command_buffer = encoder.finish();

// Submit the command buffer to the GPU
queue.submit(std::iter::once(command_buffer));

// Show the rendered frame on screen
output.present();
}

queue.submit() sends the command buffer to the GPU for execution. The GPU processes it asynchronously — your CPU code continues immediately. output.present() tells the surface to display this texture once the GPU finishes rendering to it.

Multiple render passes

You can have multiple render passes in a single frame. This is common for:

Shadow mapping: render the scene from a light’s perspective (pass 1), then render the final image using the shadow map (pass 2)
Post-processing: render the scene to a texture (pass 1), then apply a blur filter to that texture and draw the result to the screen (pass 2)

Each pass gets its own begin_render_pass / drop cycle within the same command encoder.

Key takeaway: each frame, you acquire a surface texture, record GPU commands into a command encoder (including one or more render passes), submit the commands to the queue, and present the result. The CPU and GPU work asynchronously — the CPU records commands for the next frame while the GPU executes the current one.

Part 3 — Vertex and Fragment Shaders

7. Vertices, buffers, and the vertex shader

To draw anything beyond a solid colour, you need to send geometry to the GPU. Geometry is made of vertices — points in space that define the corners of triangles. This section explains how vertex data flows from your Rust code to the vertex shader on the GPU.

What is a vertex?

A vertex is a point with associated data. At minimum, a vertex has a position, but it usually carries additional attributes:

Vertex Data (per vertex)
========================

  ┌──────────────────────────────────────────────┐
  │  position: vec3<f32>    (x, y, z)            │
  │  color:    vec3<f32>    (r, g, b)            │
  │  uv:       vec2<f32>    (texture coordinate) │
  │  normal:   vec3<f32>    (surface direction)  │
  └──────────────────────────────────────────────┘

For a simple coloured triangle, you might have three vertices with position and colour:

#![allow(unused)]
fn main() {
#[repr(C)]
#[derive(Copy, Clone, bytemuck::Pod, bytemuck::Zeroable)]
struct Vertex {
    position: [f32; 3],
    color: [f32; 3],
}

const VERTICES: &[Vertex] = &[
    Vertex { position: [ 0.0,  0.5, 0.0], color: [1.0, 0.0, 0.0] }, // top, red
    Vertex { position: [-0.5, -0.5, 0.0], color: [0.0, 1.0, 0.0] }, // left, green
    Vertex { position: [ 0.5, -0.5, 0.0], color: [0.0, 0.0, 1.0] }, // right, blue
];
}

The #[repr(C)] attribute ensures the struct has a predictable memory layout matching what the GPU expects. The bytemuck derives let us safely cast the struct to raw bytes.

Vertex buffers

To get vertex data onto the GPU, you create a vertex buffer:

#![allow(unused)]
fn main() {
use wgpu::util::DeviceExt;

let vertex_buffer = device.create_buffer_init(&wgpu::util::BufferInitDescriptor {
    label: Some("Vertex Buffer"),
    contents: bytemuck::cast_slice(VERTICES),
    usage: wgpu::BufferUsages::VERTEX,
});
}

This copies the vertex data from CPU memory into GPU memory. The VERTEX usage flag tells wgpu that this buffer will be used as a vertex buffer.

Vertex buffer layout

The GPU does not know the structure of your vertex data — you must describe it with a vertex buffer layout:

#![allow(unused)]
fn main() {
let vertex_layout = wgpu::VertexBufferLayout {
    array_stride: std::mem::size_of::<Vertex>() as u64,
    step_mode: wgpu::VertexStepMode::Vertex,
    attributes: &[
        // position: 3 floats at offset 0
        wgpu::VertexAttribute {
            format: wgpu::VertexFormat::Float32x3,
            offset: 0,
            shader_location: 0,
        },
        // color: 3 floats at offset 12 bytes (after 3 x f32)
        wgpu::VertexAttribute {
            format: wgpu::VertexFormat::Float32x3,
            offset: 12,
            shader_location: 1,
        },
    ],
};
}

This tells the GPU: “each vertex is N bytes apart (array_stride), and within each vertex, location 0 is three floats starting at byte 0, and location 1 is three floats starting at byte 12.”

The shader_location values correspond to @location(n) in your WGSL shader.

How data flows from CPU to vertex shader

CPU Memory                  GPU Memory              Vertex Shader
==========                  ==========              =============

  Vertex array     copy     Vertex buffer     read    @location(0) position
  [pos, color] ──────────►  [bytes...]   ──────────►  @location(1) color
  [pos, color]              [bytes...]
  [pos, color]              [bytes...]

  The layout descriptor tells the GPU how to
  interpret the bytes into typed attributes.

The vertex shader’s job

The vertex shader runs once per vertex. It must output a @builtin(position) value in clip space — a coordinate system where:

x ranges from -1 (left) to +1 (right)
y ranges from -1 (bottom) to +1 (top)
z ranges from 0 (near) to 1 (far)

Anything outside these ranges is clipped (not drawn).

struct VertexInput {
    @location(0) position: vec3<f32>,
    @location(1) color: vec3<f32>,
}

struct VertexOutput {
    @builtin(position) clip_position: vec4<f32>,
    @location(0) color: vec3<f32>,
}

@vertex
fn vs_main(in: VertexInput) -> VertexOutput {
    var out: VertexOutput;
    out.clip_position = vec4f(in.position, 1.0);
    out.color = in.color;
    return out;
}

In this simple shader, the position passes through unchanged (we are already working in clip space). In real applications, you would multiply by a model-view-projection matrix to transform from 3D world coordinates to clip space.

Key takeaway: vertices carry per-point data (position, colour, etc.) packed into a buffer. The vertex buffer layout tells the GPU how to decode the bytes. The vertex shader transforms each vertex’s position into clip space and passes any additional data (like colour) to the next stage.

8. Interpolation and the fragment shader

After the vertex shader has transformed all vertices and the GPU has assembled them into triangles, rasterisation takes over. This section explains what happens between the vertex shader and the fragment shader — the critical concept of interpolation.

From triangles to pixels

Rasterisation determines which pixels on screen fall inside each triangle. For each pixel inside a triangle, the rasteriser generates a fragment. But what data does each fragment carry?

Consider a triangle with a red vertex, a green vertex, and a blue vertex:

        Red (1,0,0)
           /\
          /  \
         /    \
        / what \
       / colour \
      /  is this \
     /   pixel?   \
    /______________\
Green              Blue
(0,1,0)            (0,0,1)

A pixel near the red vertex should be mostly red. A pixel exactly in the centre should be an equal mix of red, green, and blue. The GPU computes this automatically using barycentric interpolation.

Barycentric coordinates

Every point inside a triangle can be expressed as a weighted combination of the three vertices. These weights are called barycentric coordinates (w0, w1, w2), where:

w0 + w1 + w2 = 1.0
All weights are between 0 and 1

Barycentric Interpolation
=========================

  Point P inside triangle ABC:

  P = w0 * A + w1 * B + w2 * C

  At vertex A: w0=1, w1=0, w2=0  →  colour = A.color
  At vertex B: w0=0, w1=1, w2=0  →  colour = B.color
  At vertex C: w0=0, w1=0, w2=1  →  colour = C.color
  At centre:   w0=⅓, w1=⅓, w2=⅓ →  colour = average

The GPU performs this interpolation automatically for every field in the VertexOutput struct (except @builtin(position), which is used for rasterisation itself). This means colours, texture coordinates, normals — everything — gets smoothly interpolated across the triangle surface.

The fragment shader

The fragment shader runs once for each fragment generated by rasterisation. It receives the interpolated values from the vertex shader and outputs a colour:

@fragment
fn fs_main(in: VertexOutput) -> @location(0) vec4<f32> {
    return vec4f(in.color, 1.0);
}

In this simple example, the fragment shader just passes through the interpolated colour. But you can do much more:

Sample a texture using interpolated UV coordinates
Apply lighting calculations using interpolated normals
Compute procedural patterns based on the fragment position
Discard fragments to create transparency cutouts

What the fragment receives

The fragment shader’s input looks like the vertex shader’s output, but the values have been interpolated:

Vertex Shader Output          Fragment Shader Input
====================          ====================

  Vertex 0: color=(1,0,0)
  Vertex 1: color=(0,1,0)  ──►  Fragment at centre:
  Vertex 2: color=(0,0,1)        color=(0.33, 0.33, 0.33)

The @builtin(position) field in the fragment shader’s input contains the fragment’s window-space coordinates — (x, y) in pixel coordinates. This can be useful for screen-space effects.

Visual result

When you draw our red-green-blue triangle, the interpolation produces a smooth colour gradient:

  Expected visual output:
  ┌───────────────────────┐
  │                       │
  │        ▲ Red          │
  │       ╱ ╲             │
  │      ╱   ╲            │
  │     ╱ gra ╲           │
  │    ╱ dient ╲          │
  │   ╱─────────╲        │
  │  Green    Blue        │
  │                       │
  └───────────────────────┘

  Colours blend smoothly across the
  triangle surface via interpolation.

Key takeaway: the GPU automatically interpolates all vertex shader outputs across the triangle surface using barycentric coordinates. The fragment shader receives these smoothly interpolated values and uses them to compute the final pixel colour. This is why a triangle with three different vertex colours produces a smooth gradient.

9. Exercise 2: draw a coloured triangle

Time to put theory into practice. In this exercise you will extend the Exercise 1 code to draw a coloured triangle with red, green, and blue vertices.

What you will add

A WGSL shader with vertex and fragment entry points
A vertex buffer with three coloured vertices
A render pipeline that connects everything
A draw call inside the render pass

Step 1: add bytemuck to Cargo.toml

[dependencies]
wgpu = { version = "24", features = ["wgsl"] }
winit = "30"
pollster = "0.4"
log = "0.4"
env_logger = "0.11"
bytemuck = { version = "1", features = ["derive"] }

Step 2: the WGSL shader

Create a file called shader.wgsl in the same directory as main.rs (or embed it as a string — we will embed it here for simplicity):

// Vertex input: position and colour from the vertex buffer
struct VertexInput {
    @location(0) position: vec3<f32>,
    @location(1) color: vec3<f32>,
}

// Output from vertex shader, input to fragment shader
struct VertexOutput {
    @builtin(position) clip_position: vec4<f32>,
    @location(0) color: vec3<f32>,
}

@vertex
fn vs_main(in: VertexInput) -> VertexOutput {
    var out: VertexOutput;
    out.clip_position = vec4f(in.position, 1.0);
    out.color = in.color;
    return out;
}

@fragment
fn fs_main(in: VertexOutput) -> @location(0) vec4<f32> {
    return vec4f(in.color, 1.0);
}

Step 3: the Rust code

Below is the complete program. It extends Exercise 1 with a vertex buffer, pipeline, and draw call:

use winit::{
    application::ApplicationHandler,
    event::WindowEvent,
    event_loop::EventLoop,
    window::{Window, WindowAttributes},
};
use wgpu::util::DeviceExt;
use std::sync::Arc;

// Vertex data structure — must match the shader's VertexInput
#[repr(C)]
#[derive(Copy, Clone, bytemuck::Pod, bytemuck::Zeroable)]
struct Vertex {
    position: [f32; 3],
    color: [f32; 3],
}

impl Vertex {
    /// Describe the memory layout for the GPU.
    fn layout() -> wgpu::VertexBufferLayout<'static> {
        wgpu::VertexBufferLayout {
            array_stride: std::mem::size_of::<Vertex>() as u64,
            step_mode: wgpu::VertexStepMode::Vertex,
            attributes: &[
                wgpu::VertexAttribute {
                    format: wgpu::VertexFormat::Float32x3,
                    offset: 0,
                    shader_location: 0, // @location(0) position
                },
                wgpu::VertexAttribute {
                    format: wgpu::VertexFormat::Float32x3,
                    offset: 12,
                    shader_location: 1, // @location(1) color
                },
            ],
        }
    }
}

// Three vertices forming a coloured triangle
const VERTICES: &[Vertex] = &[
    Vertex { position: [ 0.0,  0.5, 0.0], color: [1.0, 0.0, 0.0] }, // top — red
    Vertex { position: [-0.5, -0.5, 0.0], color: [0.0, 1.0, 0.0] }, // bottom-left — green
    Vertex { position: [ 0.5, -0.5, 0.0], color: [0.0, 0.0, 1.0] }, // bottom-right — blue
];

struct GpuState {
    surface: wgpu::Surface<'static>,
    device: wgpu::Device,
    queue: wgpu::Queue,
    config: wgpu::SurfaceConfiguration,
    pipeline: wgpu::RenderPipeline,
    vertex_buffer: wgpu::Buffer,
}

struct App {
    window: Option<Arc<Window>>,
    gpu: Option<GpuState>,
}

impl App {
    fn new() -> Self {
        Self { window: None, gpu: None }
    }

    fn init_gpu(&mut self, window: Arc<Window>) {
        let size = window.inner_size();
        let instance = wgpu::Instance::new(&Default::default());
        let surface = instance.create_surface(window.clone()).unwrap();

        let adapter = pollster::block_on(instance.request_adapter(
            &wgpu::RequestAdapterOptions {
                compatible_surface: Some(&surface),
                ..Default::default()
            },
        )).unwrap();

        let (device, queue) = pollster::block_on(
            adapter.request_device(&Default::default(), None)
        ).unwrap();

        let config = surface
            .get_default_config(&adapter, size.width.max(1), size.height.max(1))
            .unwrap();
        surface.configure(&device, &config);

        // Create the shader module from WGSL source
        let shader = device.create_shader_module(wgpu::ShaderModuleDescriptor {
            label: Some("Triangle Shader"),
            source: wgpu::ShaderSource::Wgsl(include_str!("shader.wgsl").into()),
        });

        // Create the render pipeline
        let pipeline_layout = device.create_pipeline_layout(
            &wgpu::PipelineLayoutDescriptor {
                label: Some("Pipeline Layout"),
                bind_group_layouts: &[],
                push_constant_ranges: &[],
            },
        );

        let pipeline = device.create_render_pipeline(
            &wgpu::RenderPipelineDescriptor {
                label: Some("Triangle Pipeline"),
                layout: Some(&pipeline_layout),
                vertex: wgpu::VertexState {
                    module: &shader,
                    entry_point: Some("vs_main"),
                    buffers: &[Vertex::layout()],
                    compilation_options: Default::default(),
                },
                fragment: Some(wgpu::FragmentState {
                    module: &shader,
                    entry_point: Some("fs_main"),
                    targets: &[Some(wgpu::ColorTargetState {
                        format: config.format,
                        blend: Some(wgpu::BlendState::REPLACE),
                        write_mask: wgpu::ColorWrites::ALL,
                    })],
                    compilation_options: Default::default(),
                }),
                primitive: wgpu::PrimitiveState {
                    topology: wgpu::PrimitiveTopology::TriangleList,
                    strip_index_format: None,
                    front_face: wgpu::FrontFace::Ccw,
                    cull_mode: Some(wgpu::Face::Back),
                    unclipped_depth: false,
                    polygon_mode: wgpu::PolygonMode::Fill,
                    conservative: false,
                },
                depth_stencil: None,
                multisample: wgpu::MultisampleState::default(),
                multiview: None,
                cache: None,
            },
        );

        // Create the vertex buffer
        let vertex_buffer = device.create_buffer_init(
            &wgpu::util::BufferInitDescriptor {
                label: Some("Vertex Buffer"),
                contents: bytemuck::cast_slice(VERTICES),
                usage: wgpu::BufferUsages::VERTEX,
            },
        );

        self.gpu = Some(GpuState {
            surface, device, queue, config, pipeline, vertex_buffer,
        });
    }

    fn render(&self) {
        let gpu = self.gpu.as_ref().unwrap();
        let output = gpu.surface.get_current_texture().unwrap();
        let view = output.texture.create_view(&Default::default());
        let mut encoder = gpu.device.create_command_encoder(&Default::default());

        {
            let mut pass = encoder.begin_render_pass(&wgpu::RenderPassDescriptor {
                label: Some("Triangle Pass"),
                color_attachments: &[Some(wgpu::RenderPassColorAttachment {
                    view: &view,
                    resolve_target: None,
                    ops: wgpu::Operations {
                        load: wgpu::LoadOp::Clear(wgpu::Color {
                            r: 0.1, g: 0.1, b: 0.1, a: 1.0,
                        }),
                        store: wgpu::StoreOp::Store,
                    },
                })],
                depth_stencil_attachment: None,
                ..Default::default()
            });

            pass.set_pipeline(&gpu.pipeline);
            pass.set_vertex_buffer(0, gpu.vertex_buffer.slice(..));
            pass.draw(0..3, 0..1); // 3 vertices, 1 instance
        }

        gpu.queue.submit(std::iter::once(encoder.finish()));
        output.present();
    }
}

impl ApplicationHandler for App {
    fn resumed(&mut self, event_loop: &winit::event_loop::ActiveEventLoop) {
        if self.window.is_none() {
            let window = Arc::new(
                event_loop.create_window(
                    WindowAttributes::default().with_title("Exercise 2: Coloured Triangle")
                ).unwrap()
            );
            self.init_gpu(window.clone());
            self.window = Some(window);
        }
    }

    fn window_event(
        &mut self,
        event_loop: &winit::event_loop::ActiveEventLoop,
        _id: winit::window::WindowId,
        event: WindowEvent,
    ) {
        match event {
            WindowEvent::CloseRequested => event_loop.exit(),
            WindowEvent::Resized(size) => {
                if let Some(gpu) = &mut self.gpu {
                    gpu.config.width = size.width.max(1);
                    gpu.config.height = size.height.max(1);
                    gpu.surface.configure(&gpu.device, &gpu.config);
                }
            }
            WindowEvent::RedrawRequested => {
                self.render();
                if let Some(w) = &self.window { w.request_redraw(); }
            }
            _ => {}
        }
    }
}

fn main() {
    env_logger::init();
    let event_loop = EventLoop::new().unwrap();
    event_loop.run_app(&mut App::new()).unwrap();
}

Step 4: run and observe

cargo run

You should see a triangle centred in the window with a smooth gradient: red at the top, green at the bottom-left, and blue at the bottom-right. The colours blend smoothly across the surface thanks to the interpolation discussed in section 8.

Key concepts demonstrated

Vertex struct with #[repr(C)] and bytemuck for safe casting to bytes
Vertex buffer layout mapping struct fields to @location(n) in the shader
Shader module loaded from WGSL source via include_str!
Render pipeline connecting shaders, vertex layout, and output format
Draw call (pass.draw(0..3, 0..1)) telling the GPU to process 3 vertices as one triangle

Challenge: add three more vertices and draw a second triangle to form a rectangle. You will need 6 vertices total (two triangles of 3 vertices each) and change the draw call to pass.draw(0..6, 0..1).

10. Exercise 3: animate the triangle using a time uniform

Static shapes are nice, but animation is where shaders really shine. In this exercise you will pass an elapsed time value to the vertex shader and use it to rotate the triangle.

New concepts

Uniform buffers: small, read-only buffers for data that is the same across all vertices/fragments in a draw call (like time, camera matrices, light positions)
Bind groups: how you connect uniform buffers (and other resources) to shader bindings
Updating buffers: writing new data to a buffer each frame

Step 1: the updated WGSL shader

struct VertexInput {
    @location(0) position: vec3<f32>,
    @location(1) color: vec3<f32>,
}

struct VertexOutput {
    @builtin(position) clip_position: vec4<f32>,
    @location(0) color: vec3<f32>,
}

// A uniform buffer containing the elapsed time
@group(0) @binding(0)
var<uniform> time: f32;

@vertex
fn vs_main(in: VertexInput) -> VertexOutput {
    // Rotate the vertex around the Z axis
    let angle = time;
    let cos_a = cos(angle);
    let sin_a = sin(angle);

    let rotated = vec3f(
        in.position.x * cos_a - in.position.y * sin_a,
        in.position.x * sin_a + in.position.y * cos_a,
        in.position.z,
    );

    var out: VertexOutput;
    out.clip_position = vec4f(rotated, 1.0);
    out.color = in.color;
    return out;
}

@fragment
fn fs_main(in: VertexOutput) -> @location(0) vec4<f32> {
    return vec4f(in.color, 1.0);
}

The key change is the time uniform and the 2D rotation matrix applied to each vertex. The rotation is:

x' = x * cos(angle) - y * sin(angle)
y' = x * sin(angle) + y * cos(angle)

This rotates the triangle around the origin (centre of clip space) at one radian per second.

Step 2: create the uniform buffer and bind group

On the Rust side, you need to:

Create a buffer for the time value
Create a bind group layout describing the binding
Create a bind group linking the buffer to the layout
Update the pipeline layout to include the bind group layout

#![allow(unused)]
fn main() {
use std::time::Instant;

// Create the uniform buffer (4 bytes for one f32)
let time_buffer = device.create_buffer(&wgpu::BufferDescriptor {
    label: Some("Time Uniform Buffer"),
    size: std::mem::size_of::<f32>() as u64,
    usage: wgpu::BufferUsages::UNIFORM | wgpu::BufferUsages::COPY_DST,
    mapped_at_creation: false,
});

// Create a bind group layout
let bind_group_layout = device.create_bind_group_layout(
    &wgpu::BindGroupLayoutDescriptor {
        label: Some("Time Bind Group Layout"),
        entries: &[wgpu::BindGroupLayoutEntry {
            binding: 0,
            visibility: wgpu::ShaderStages::VERTEX,
            ty: wgpu::BindingType::Buffer {
                ty: wgpu::BufferBindingType::Uniform,
                has_dynamic_offset: false,
                min_binding_size: None,
            },
            count: None,
        }],
    },
);

// Create the bind group
let time_bind_group = device.create_bind_group(&wgpu::BindGroupDescriptor {
    label: Some("Time Bind Group"),
    layout: &bind_group_layout,
    entries: &[wgpu::BindGroupEntry {
        binding: 0,
        resource: time_buffer.as_entire_binding(),
    }],
});

// Update the pipeline layout to include our bind group
let pipeline_layout = device.create_pipeline_layout(
    &wgpu::PipelineLayoutDescriptor {
        label: Some("Animated Pipeline Layout"),
        bind_group_layouts: &[&bind_group_layout],
        push_constant_ranges: &[],
    },
);
}

Step 3: update the buffer each frame

In your render function, before beginning the render pass, write the current time to the buffer:

#![allow(unused)]
fn main() {
let elapsed = self.start_time.elapsed().as_secs_f32();
gpu.queue.write_buffer(&gpu.time_buffer, 0, bytemuck::cast_slice(&[elapsed]));
}

queue.write_buffer copies data from CPU memory into the GPU buffer. This is the simplest way to update a uniform each frame.

Step 4: bind the group in the render pass

Inside your render pass, after setting the pipeline:

#![allow(unused)]
fn main() {
pass.set_pipeline(&gpu.pipeline);
pass.set_bind_group(0, &gpu.time_bind_group, &[]);  // group 0
pass.set_vertex_buffer(0, gpu.vertex_buffer.slice(..));
pass.draw(0..3, 0..1);
}

The set_bind_group(0, ...) call makes the time buffer available to the shader as @group(0) @binding(0).

Expected result

When you run the program, you should see the coloured triangle smoothly rotating around the centre of the window. The triangle completes one full rotation every 2*pi (approximately 6.28) seconds.

Understanding the flow

  Each frame:
  ┌─────────────────────────────────────────────────────────────┐
  │                                                             │
  │  CPU: elapsed = Instant::now() - start                      │
  │       queue.write_buffer(time_buffer, elapsed)              │
  │                                                             │
  │  GPU: time uniform ← time_buffer                            │
  │       for each vertex:                                      │
  │           rotated_pos = rotate(vertex.pos, time)            │
  │           output clip_position = rotated_pos                │
  │                                                             │
  │  Result: triangle rotates smoothly                          │
  └─────────────────────────────────────────────────────────────┘

Challenge: instead of (or in addition to) rotating, try making the triangle pulse in size using sin(time) as a scale factor. Or make it bounce by adding sin(time) * 0.3 to the y position.

Key takeaway: uniform buffers let you pass per-frame data (time, matrices, parameters) to shaders. You create a buffer, describe its layout in a bind group, bind it during the render pass, and access it in WGSL via @group(n) @binding(n). This is how you make shaders dynamic.

Part 4 — Textures and Samplers

11. Texture coordinates (UVs), texture creation, sampler config

Solid colours and gradients are a start, but most real-world graphics use textures — images mapped onto surfaces. This section explains how textures work, how UV coordinates map image data onto geometry, and how samplers control the lookup.

What are UV coordinates?

UV coordinates (also called texture coordinates) describe where on a texture each vertex should sample from. They range from (0, 0) at the top-left of the texture to (1, 1) at the bottom-right:

Texture UV Space
================

  (0,0)────────────────(1,0)
    │                    │
    │    ┌──────────┐    │
    │    │          │    │
    │    │  image   │    │
    │    │  data    │    │
    │    │          │    │
    │    └──────────┘    │
    │                    │
  (0,1)────────────────(1,1)

  Note: in wgpu/WebGPU, (0,0) is the top-left
  and v increases downward.

Each vertex carries a UV coordinate. When the GPU rasterises a triangle, it interpolates these UVs across the surface (just like it interpolates colours). The fragment shader then uses the interpolated UV to look up a colour from the texture.

Quad with UV mapping
====================

  Vertex 0: pos=(-0.5, 0.5)  uv=(0, 0)  ← top-left
  Vertex 1: pos=( 0.5, 0.5)  uv=(1, 0)  ← top-right
  Vertex 2: pos=(-0.5,-0.5)  uv=(0, 1)  ← bottom-left
  Vertex 3: pos=( 0.5,-0.5)  uv=(1, 1)  ← bottom-right

  The full texture maps exactly onto the quad.

Creating a texture in wgpu

To use a texture, you need to:

Create the texture on the GPU
Upload the image data
Create a texture view for accessing it in shaders
Create a sampler that controls how texels are looked up

#![allow(unused)]
fn main() {
// Step 1: Create the texture
let texture = device.create_texture(&wgpu::TextureDescriptor {
    label: Some("My Texture"),
    size: wgpu::Extent3d {
        width: img_width,
        height: img_height,
        depth_or_array_layers: 1,
    },
    mip_level_count: 1,
    sample_count: 1,
    dimension: wgpu::TextureDimension::D2,
    format: wgpu::TextureFormat::Rgba8UnormSrgb,
    usage: wgpu::TextureUsages::TEXTURE_BINDING | wgpu::TextureUsages::COPY_DST,
    view_formats: &[],
});

// Step 2: Upload the pixel data
queue.write_texture(
    wgpu::TexelCopyTextureInfo {
        texture: &texture,
        mip_level: 0,
        origin: wgpu::Origin3d::ZERO,
        aspect: wgpu::TextureAspect::All,
    },
    &rgba_bytes,  // &[u8] of RGBA pixel data
    wgpu::TexelCopyBufferLayout {
        offset: 0,
        bytes_per_row: Some(4 * img_width),
        rows_per_image: Some(img_height),
    },
    wgpu::Extent3d {
        width: img_width,
        height: img_height,
        depth_or_array_layers: 1,
    },
);

// Step 3: Create a view
let texture_view = texture.create_view(&Default::default());
}

Sampler configuration

A sampler controls how the GPU looks up texels (texture pixels) when the UV does not land exactly on a texel centre. There are two key settings:

Filtering controls how texels are blended:

Nearest: picks the closest texel (pixelated look, fast)
Linear: blends the four nearest texels (smooth look)

  Nearest filtering      Linear filtering
  ==================     =================

  ┌───┬───┬───┐          ┌───┬───┬───┐
  │ A │ B │   │          │ A │ B │   │
  ├───┼───┼───┤          ├───┼╌╌╌┼───┤
  │ C │ D │   │          │ C │avg│   │
  ├───┼───┼───┤          ├───┼───┼───┤
  │   │   │   │          │   │   │   │
  └───┴───┴───┘          └───┴───┴───┘

  Nearest: picks one     Linear: blends A,B,C,D
  texel (e.g., A)        based on distance

Address mode (wrapping) controls what happens when UVs go outside the 0-1 range:

ClampToEdge: UVs outside 0-1 use the edge colour
Repeat: the texture tiles
MirrorRepeat: the texture tiles, flipping every other repetition

#![allow(unused)]
fn main() {
let sampler = device.create_sampler(&wgpu::SamplerDescriptor {
    label: Some("Texture Sampler"),
    address_mode_u: wgpu::AddressMode::ClampToEdge,
    address_mode_v: wgpu::AddressMode::ClampToEdge,
    address_mode_w: wgpu::AddressMode::ClampToEdge,
    mag_filter: wgpu::FilterMode::Linear,
    min_filter: wgpu::FilterMode::Linear,
    mipmap_filter: wgpu::FilterMode::Nearest,
    ..Default::default()
});
}

Bind groups for textures

Textures and samplers are bound to shaders using bind groups, just like uniform buffers:

#![allow(unused)]
fn main() {
let bind_group_layout = device.create_bind_group_layout(
    &wgpu::BindGroupLayoutDescriptor {
        label: Some("Texture Bind Group Layout"),
        entries: &[
            // The texture
            wgpu::BindGroupLayoutEntry {
                binding: 0,
                visibility: wgpu::ShaderStages::FRAGMENT,
                ty: wgpu::BindingType::Texture {
                    sample_type: wgpu::TextureSampleType::Float { filterable: true },
                    view_dimension: wgpu::TextureViewDimension::D2,
                    multisampled: false,
                },
                count: None,
            },
            // The sampler
            wgpu::BindGroupLayoutEntry {
                binding: 1,
                visibility: wgpu::ShaderStages::FRAGMENT,
                ty: wgpu::BindingType::Sampler(
                    wgpu::SamplerBindingType::Filtering,
                ),
                count: None,
            },
        ],
    },
);
}

In WGSL, you access them like this:

@group(0) @binding(0)
var t_diffuse: texture_2d<f32>;

@group(0) @binding(1)
var s_diffuse: sampler;

@fragment
fn fs_main(in: VertexOutput) -> @location(0) vec4<f32> {
    return textureSample(t_diffuse, s_diffuse, in.uv);
}

The textureSample function takes a texture, a sampler, and UV coordinates, and returns the sampled colour.

Key takeaway: textures are images stored on the GPU. UV coordinates map texture space onto geometry. Samplers control filtering (nearest vs linear) and wrapping behaviour. The fragment shader uses textureSample to look up a colour from the texture at interpolated UV coordinates.

12. Exercise 4: render a textured quad

In this exercise you will draw a rectangle (two triangles forming a quad) with a texture mapped onto it. You will create a procedural checkerboard texture in code rather than loading an image file, keeping the exercise self-contained.

Step 1: add dependencies

We do not need an image loading crate for this exercise since we generate the texture procedurally. The same Cargo.toml from Exercise 2 works, with bytemuck already included.

Step 2: the WGSL shader

struct VertexInput {
    @location(0) position: vec3<f32>,
    @location(1) uv: vec2<f32>,
}

struct VertexOutput {
    @builtin(position) clip_position: vec4<f32>,
    @location(0) uv: vec2<f32>,
}

@vertex
fn vs_main(in: VertexInput) -> VertexOutput {
    var out: VertexOutput;
    out.clip_position = vec4f(in.position, 1.0);
    out.uv = in.uv;
    return out;
}

@group(0) @binding(0)
var t_texture: texture_2d<f32>;

@group(0) @binding(1)
var s_sampler: sampler;

@fragment
fn fs_main(in: VertexOutput) -> @location(0) vec4<f32> {
    return textureSample(t_texture, s_sampler, in.uv);
}

Note how the vertex now carries a vec2<f32> UV coordinate instead of a colour. The fragment shader samples the texture at the interpolated UV.

Step 3: vertex data for a quad

A quad is two triangles. We define six vertices (or four vertices with an index buffer — we will use six for simplicity):

#![allow(unused)]
fn main() {
#[repr(C)]
#[derive(Copy, Clone, bytemuck::Pod, bytemuck::Zeroable)]
struct Vertex {
    position: [f32; 3],
    uv: [f32; 2],
}

// Two triangles forming a quad
const VERTICES: &[Vertex] = &[
    // Triangle 1 (top-left half)
    Vertex { position: [-0.5,  0.5, 0.0], uv: [0.0, 0.0] }, // top-left
    Vertex { position: [-0.5, -0.5, 0.0], uv: [0.0, 1.0] }, // bottom-left
    Vertex { position: [ 0.5,  0.5, 0.0], uv: [1.0, 0.0] }, // top-right
    // Triangle 2 (bottom-right half)
    Vertex { position: [ 0.5,  0.5, 0.0], uv: [1.0, 0.0] }, // top-right
    Vertex { position: [-0.5, -0.5, 0.0], uv: [0.0, 1.0] }, // bottom-left
    Vertex { position: [ 0.5, -0.5, 0.0], uv: [1.0, 1.0] }, // bottom-right
];
}

Step 4: generate a procedural checkerboard texture

#![allow(unused)]
fn main() {
/// Generate an 8x8 checkerboard pattern as RGBA bytes.
fn make_checkerboard(width: u32, height: u32, cell_size: u32) -> Vec<u8> {
    let mut pixels = Vec::with_capacity((width * height * 4) as usize);
    for y in 0..height {
        for x in 0..width {
            let is_white = ((x / cell_size) + (y / cell_size)) % 2 == 0;
            let val = if is_white { 255u8 } else { 80u8 };
            pixels.push(val); // R
            pixels.push(val); // G
            pixels.push(val); // B
            pixels.push(255); // A
        }
    }
    pixels
}
}

Call it with make_checkerboard(256, 256, 32) to get a 256x256 texture with 32-pixel checker cells.

Step 5: create the texture, view, and sampler

#![allow(unused)]
fn main() {
let tex_size = 256u32;
let tex_data = make_checkerboard(tex_size, tex_size, 32);

let texture = device.create_texture(&wgpu::TextureDescriptor {
    label: Some("Checkerboard Texture"),
    size: wgpu::Extent3d {
        width: tex_size,
        height: tex_size,
        depth_or_array_layers: 1,
    },
    mip_level_count: 1,
    sample_count: 1,
    dimension: wgpu::TextureDimension::D2,
    format: wgpu::TextureFormat::Rgba8UnormSrgb,
    usage: wgpu::TextureUsages::TEXTURE_BINDING | wgpu::TextureUsages::COPY_DST,
    view_formats: &[],
});

queue.write_texture(
    wgpu::TexelCopyTextureInfo {
        texture: &texture,
        mip_level: 0,
        origin: wgpu::Origin3d::ZERO,
        aspect: wgpu::TextureAspect::All,
    },
    &tex_data,
    wgpu::TexelCopyBufferLayout {
        offset: 0,
        bytes_per_row: Some(4 * tex_size),
        rows_per_image: Some(tex_size),
    },
    wgpu::Extent3d {
        width: tex_size,
        height: tex_size,
        depth_or_array_layers: 1,
    },
);

let texture_view = texture.create_view(&Default::default());

let sampler = device.create_sampler(&wgpu::SamplerDescriptor {
    label: Some("Checkerboard Sampler"),
    mag_filter: wgpu::FilterMode::Nearest, // crisp pixels for checkerboard
    min_filter: wgpu::FilterMode::Nearest,
    ..Default::default()
});
}

Step 6: bind group setup

#![allow(unused)]
fn main() {
let bind_group_layout = device.create_bind_group_layout(
    &wgpu::BindGroupLayoutDescriptor {
        label: Some("Texture Bind Group Layout"),
        entries: &[
            wgpu::BindGroupLayoutEntry {
                binding: 0,
                visibility: wgpu::ShaderStages::FRAGMENT,
                ty: wgpu::BindingType::Texture {
                    sample_type: wgpu::TextureSampleType::Float { filterable: true },
                    view_dimension: wgpu::TextureViewDimension::D2,
                    multisampled: false,
                },
                count: None,
            },
            wgpu::BindGroupLayoutEntry {
                binding: 1,
                visibility: wgpu::ShaderStages::FRAGMENT,
                ty: wgpu::BindingType::Sampler(wgpu::SamplerBindingType::Filtering),
                count: None,
            },
        ],
    },
);

let bind_group = device.create_bind_group(&wgpu::BindGroupDescriptor {
    label: Some("Texture Bind Group"),
    layout: &bind_group_layout,
    entries: &[
        wgpu::BindGroupEntry {
            binding: 0,
            resource: wgpu::BindingResource::TextureView(&texture_view),
        },
        wgpu::BindGroupEntry {
            binding: 1,
            resource: wgpu::BindingResource::Sampler(&sampler),
        },
    ],
});
}

Remember to include &bind_group_layout in your pipeline layout’s bind_group_layouts array, and update the vertex buffer layout to match the new Vertex struct (position: Float32x3 at offset 0, uv: Float32x2 at offset 12).

Step 7: draw the quad

In your render pass:

#![allow(unused)]
fn main() {
pass.set_pipeline(&gpu.pipeline);
pass.set_bind_group(0, &gpu.bind_group, &[]);
pass.set_vertex_buffer(0, gpu.vertex_buffer.slice(..));
pass.draw(0..6, 0..1); // 6 vertices = 2 triangles = 1 quad
}

Expected result

You should see a rectangle in the centre of the window showing a black-and-white checkerboard pattern. The texture is mapped so that the full checkerboard fills the quad exactly.

Challenge: try changing the sampler’s mag_filter from Nearest to Linear and see how the checkerboard edges become blurred when the quad is large. Then try setting address_mode_u and address_mode_v to Repeat, and change the UVs to go from 0 to 3 — you will see the checkerboard tile three times across the quad.

Key takeaway: texturing involves creating a texture from pixel data, configuring a sampler for filtering and wrapping, binding both via a bind group, and sampling in the fragment shader using interpolated UV coordinates. This same pattern applies whether your texture is a checkerboard, a photograph, or a render target from a previous pass.

Part 5 — Compute Shaders

13. Compute pipelines: dispatching work groups

Compute shaders break free from the graphics pipeline entirely. There are no vertices, no triangles, no pixels — just raw parallel computation. This makes them ideal for physics simulations, image processing, data transformations, and any task that benefits from GPU parallelism.

Graphics pipeline vs compute pipeline

Graphics Pipeline              Compute Pipeline
=================              ================

  Vertices                       Dispatch(x, y, z)
    │                               │
    ▼                               ▼
  Vertex Shader                  Compute Shader
    │                               │
    ▼                               ▼
  Rasterisation                  Storage buffers /
    │                            textures (output)
    ▼
  Fragment Shader
    │
    ▼
  Framebuffer (pixels)

  Produces images               Produces data

With compute shaders, you do not set up vertex buffers, render passes, or colour attachments. Instead, you dispatch work and let the compute shader read/write storage buffers or textures directly.

Work groups and invocations

When you dispatch a compute shader, you specify a 3D grid of work groups. Each work group contains a fixed number of invocations (threads), defined by @workgroup_size in the shader.

Dispatch and Work Groups
========================

  dispatch(4, 3, 1)  ← 4 x 3 x 1 = 12 work groups

  ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
  │ WG  │ │ WG  │ │ WG  │ │ WG  │   row 0
  │(0,0)│ │(1,0)│ │(2,0)│ │(3,0)│
  └─────┘ └─────┘ └─────┘ └─────┘
  ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
  │ WG  │ │ WG  │ │ WG  │ │ WG  │   row 1
  │(0,1)│ │(1,1)│ │(2,1)│ │(3,1)│
  └─────┘ └─────┘ └─────┘ └─────┘
  ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
  │ WG  │ │ WG  │ │ WG  │ │ WG  │   row 2
  │(0,2)│ │(1,2)│ │(2,2)│ │(3,2)│
  └─────┘ └─────┘ └─────┘ └─────┘

  Inside each work group (e.g., @workgroup_size(8, 8, 1)):

  ┌─┬─┬─┬─┬─┬─┬─┬─┐
  │·│·│·│·│·│·│·│·│  8 invocations wide
  ├─┼─┼─┼─┼─┼─┼─┼─┤
  │·│·│·│·│·│·│·│·│  x 8 invocations tall
  ├─┼─┼─┼─┼─┼─┼─┼─┤
  │·│·│·│·│·│·│·│·│  = 64 invocations per
  ├─┼─┼─┼─┼─┼─┼─┼─┤    work group
  │·│·│·│·│·│·│·│·│
  ├─┼─┼─┼─┼─┼─┼─┼─┤
  │·│·│·│·│·│·│·│·│
  ├─┼─┼─┼─┼─┼─┼─┼─┤
  │·│·│·│·│·│·│·│·│
  ├─┼─┼─┼─┼─┼─┼─┼─┤
  │·│·│·│·│·│·│·│·│
  ├─┼─┼─┼─┼─┼─┼─┼─┤
  │·│·│·│·│·│·│·│·│
  └─┴─┴─┴─┴─┴─┴─┴─┘

  Total invocations = 12 work groups x 64 = 768 threads

Built-in IDs

Each invocation knows its position in the grid via built-in variables:

Built-in	Type	Meaning
`global_invocation_id`	`vec3<u32>`	Unique ID across the entire dispatch
`local_invocation_id`	`vec3<u32>`	ID within the work group (0 to workgroup_size-1)
`workgroup_id`	`vec3<u32>`	Which work group this invocation belongs to
`num_workgroups`	`vec3<u32>`	Total number of work groups dispatched

global_invocation_id is the most commonly used — it gives each thread a unique index.

@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) id: vec3<u32>) {
    let index = id.x;
    // Process element at `index`
}

Choosing workgroup_size

The @workgroup_size(x, y, z) declaration sets how many invocations run per work group. Guidelines:

Total invocations per group (x * y * z) should be a multiple of 32 or 64 for best performance (matching GPU warp/wavefront size)
Common choices: @workgroup_size(64), @workgroup_size(256), @workgroup_size(8, 8) (for 2D), @workgroup_size(4, 4, 4) (for 3D)
The maximum total varies by GPU but is typically 256 or 1024

Creating a compute pipeline in Rust

#![allow(unused)]
fn main() {
let compute_shader = device.create_shader_module(wgpu::ShaderModuleDescriptor {
    label: Some("Compute Shader"),
    source: wgpu::ShaderSource::Wgsl(shader_source.into()),
});

let compute_pipeline = device.create_compute_pipeline(
    &wgpu::ComputePipelineDescriptor {
        label: Some("Compute Pipeline"),
        layout: Some(&pipeline_layout),
        module: &compute_shader,
        entry_point: Some("main"),
        compilation_options: Default::default(),
        cache: None,
    },
);
}

Dispatching work

Instead of a render pass, you use a compute pass:

#![allow(unused)]
fn main() {
let mut encoder = device.create_command_encoder(&Default::default());
{
    let mut compute_pass = encoder.begin_compute_pass(&Default::default());
    compute_pass.set_pipeline(&compute_pipeline);
    compute_pass.set_bind_group(0, &bind_group, &[]);
    compute_pass.dispatch_workgroups(num_groups_x, num_groups_y, num_groups_z);
}
queue.submit(std::iter::once(encoder.finish()));
}

If you have 1024 elements and your workgroup_size is 64, you dispatch 1024 / 64 = 16 work groups: dispatch_workgroups(16, 1, 1).

Key takeaway: compute shaders run outside the graphics pipeline. You dispatch a 3D grid of work groups, each containing a fixed number of invocations. Every invocation gets a unique global_invocation_id to determine which data element to process. This is how you harness the GPU’s parallelism for general-purpose computation.

14. Storage buffers and read/write access from WGSL

Compute shaders need to read input data and write output data. Storage buffers are the primary mechanism for this. Unlike uniform buffers (which are small and read-only), storage buffers can be large and support both reading and writing.

Storage buffers vs uniform buffers

Feature	Uniform Buffer	Storage Buffer
Max size	~64 KB (varies)	Hundreds of MB
Access	Read-only	Read-only or read-write
Speed	Faster (cached aggressively)	Slightly slower
Use case	Small, per-frame constants	Large data arrays

Use uniform buffers for things like transformation matrices, time values, and camera parameters. Use storage buffers for arrays of particles, pixels, mesh data, or any large dataset.

Declaring storage buffers in WGSL

// Read-only storage buffer
@group(0) @binding(0)
var<storage, read> input: array<f32>;

// Read-write storage buffer
@group(0) @binding(1)
var<storage, read_write> output: array<f32>;

You can also use structs:

struct Particle {
    position: vec2<f32>,
    velocity: vec2<f32>,
}

@group(0) @binding(0)
var<storage, read_write> particles: array<Particle>;

Accessing storage buffer data

Storage buffers behave like regular arrays in WGSL:

@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) id: vec3<u32>) {
    let i = id.x;

    // Bounds check — important when dispatch size
    // does not evenly divide the data
    if i >= arrayLength(&particles) {
        return;
    }

    // Read
    let pos = particles[i].position;
    let vel = particles[i].velocity;

    // Compute
    let new_pos = pos + vel * delta_time;

    // Write back
    particles[i].position = new_pos;
}

The arrayLength(&buffer) function returns the number of elements in a runtime-sized array. Always use it for bounds checking — if your dispatch creates more invocations than data elements, the extra threads must bail out early.

Creating storage buffers in Rust

#![allow(unused)]
fn main() {
// Create a storage buffer from initial data
let storage_buffer = device.create_buffer_init(&wgpu::util::BufferInitDescriptor {
    label: Some("Particle Buffer"),
    contents: bytemuck::cast_slice(&initial_particles),
    usage: wgpu::BufferUsages::STORAGE
         | wgpu::BufferUsages::COPY_SRC  // to read back to CPU
         | wgpu::BufferUsages::COPY_DST, // to write from CPU
});
}

The STORAGE usage flag is required. Add COPY_SRC if you want to read data back to the CPU, and COPY_DST if you want to upload data from the CPU.

Bind group layout for storage buffers

#![allow(unused)]
fn main() {
wgpu::BindGroupLayoutEntry {
    binding: 0,
    visibility: wgpu::ShaderStages::COMPUTE,
    ty: wgpu::BindingType::Buffer {
        ty: wgpu::BufferBindingType::Storage {
            read_only: false, // true for read-only access
        },
        has_dynamic_offset: false,
        min_binding_size: None,
    },
    count: None,
}
}

Reading results back to the CPU

GPU buffers are not directly accessible from CPU memory. To read results back, you copy to a staging buffer with MAP_READ usage:

#![allow(unused)]
fn main() {
// Create a staging buffer
let staging_buffer = device.create_buffer(&wgpu::BufferDescriptor {
    label: Some("Staging Buffer"),
    size: storage_buffer.size(),
    usage: wgpu::BufferUsages::MAP_READ | wgpu::BufferUsages::COPY_DST,
    mapped_at_creation: false,
});

// Copy from storage to staging
encoder.copy_buffer_to_buffer(
    &storage_buffer, 0,
    &staging_buffer, 0,
    storage_buffer.size(),
);
queue.submit(std::iter::once(encoder.finish()));

// Map the staging buffer and read the data
let slice = staging_buffer.slice(..);
slice.map_async(wgpu::MapMode::Read, |_| {});
device.poll(wgpu::Maintain::Wait);

let data = slice.get_mapped_range();
let result: &[Particle] = bytemuck::cast_slice(&data);
// Use the result...

drop(data);
staging_buffer.unmap();
}

Memory considerations

Workgroup memory: WGSL also supports var<workgroup> for shared memory within a work group. This is very fast but limited in size (typically 16-48 KB).
Synchronization: within a work group, use workgroupBarrier() to ensure all threads have finished writing before any thread reads shared data. Across work groups, there is no synchronization within a single dispatch — use separate dispatches if you need global barriers.

var<workgroup> shared_data: array<f32, 64>;

@compute @workgroup_size(64)
fn main(@builtin(local_invocation_id) lid: vec3<u32>) {
    shared_data[lid.x] = some_computation();
    workgroupBarrier();  // wait for all threads in this group
    let neighbour = shared_data[(lid.x + 1u) % 64u];
}

Key takeaway: storage buffers are the workhorse of compute shaders — they hold large arrays that shaders can read and write. Declare them with var<storage, read_write> in WGSL, create them with BufferUsages::STORAGE in Rust, and always bounds-check with arrayLength. To read results back to CPU, copy to a staging buffer with MAP_READ.

15. Exercise 5: GPU-accelerate a particle simulation

In this exercise you will build a simple particle system where thousands of particles are updated each frame by a compute shader. Particles will have positions and velocities, bounce off the edges of the screen, and be rendered as points.

Overview

The architecture is:

  ┌──────────────┐     ┌──────────────────┐     ┌─────────────┐
  │ CPU: init     │────►│ GPU: compute pass │────►│ GPU: render  │
  │ particles     │     │ update positions  │     │ pass: draw   │
  │ once          │     │ each frame        │     │ as points    │
  └──────────────┘     └──────────────────┘     └─────────────┘
                              │
                              ▼
                        Storage buffer
                        (read/write by
                         compute shader,
                         read as vertex
                         buffer for render)

The same buffer serves double duty: the compute shader writes updated positions into it, and the render pass reads it as a vertex buffer.

Step 1: particle data structure

#![allow(unused)]
fn main() {
#[repr(C)]
#[derive(Copy, Clone, bytemuck::Pod, bytemuck::Zeroable)]
struct Particle {
    position: [f32; 2],
    velocity: [f32; 2],
}
}

Step 2: initialise particles

#![allow(unused)]
fn main() {
use rand::Rng;

fn create_particles(count: usize) -> Vec<Particle> {
    let mut rng = rand::rng();
    (0..count)
        .map(|_| Particle {
            position: [
                rng.random_range(-1.0f32..1.0),
                rng.random_range(-1.0f32..1.0),
            ],
            velocity: [
                rng.random_range(-0.5f32..0.5),
                rng.random_range(-0.5f32..0.5),
            ],
        })
        .collect()
}
}

Add rand = "0.9" to your Cargo.toml.

Step 3: the compute shader (WGSL)

struct Particle {
    position: vec2<f32>,
    velocity: vec2<f32>,
}

@group(0) @binding(0)
var<storage, read_write> particles: array<Particle>;

@group(0) @binding(1)
var<uniform> delta_time: f32;

@compute @workgroup_size(64)
fn cs_main(@builtin(global_invocation_id) id: vec3<u32>) {
    let i = id.x;
    if i >= arrayLength(&particles) {
        return;
    }

    var p = particles[i];

    // Update position
    p.position = p.position + p.velocity * delta_time;

    // Bounce off edges
    if p.position.x < -1.0 || p.position.x > 1.0 {
        p.velocity.x = -p.velocity.x;
        p.position.x = clamp(p.position.x, -1.0, 1.0);
    }
    if p.position.y < -1.0 || p.position.y > 1.0 {
        p.velocity.y = -p.velocity.y;
        p.position.y = clamp(p.position.y, -1.0, 1.0);
    }

    particles[i] = p;
}

Step 4: the render shader (WGSL)

To render particles as points, the vertex shader reads the position from the storage buffer. Each particle becomes one point:

struct RenderOutput {
    @builtin(position) pos: vec4<f32>,
    @builtin(point_size) size: f32,
}

// We read the same particle buffer as a storage buffer for rendering
@group(0) @binding(0)
var<storage, read> render_particles: array<Particle>;

@vertex
fn vs_render(@builtin(vertex_index) vi: u32) -> RenderOutput {
    var out: RenderOutput;
    let p = render_particles[vi];
    out.pos = vec4f(p.position, 0.0, 1.0);
    out.size = 2.0;
    return out;
}

@fragment
fn fs_render() -> @location(0) vec4<f32> {
    return vec4f(0.2, 0.8, 0.4, 1.0); // green particles
}

Note: @builtin(point_size) is an optional feature; not all backends support it. An alternative approach is to render each particle as a small quad using instancing.

Step 5: buffer creation

#![allow(unused)]
fn main() {
let num_particles = 10_000u32;
let particles = create_particles(num_particles as usize);

let particle_buffer = device.create_buffer_init(&wgpu::util::BufferInitDescriptor {
    label: Some("Particle Buffer"),
    contents: bytemuck::cast_slice(&particles),
    usage: wgpu::BufferUsages::STORAGE | wgpu::BufferUsages::VERTEX,
});

let dt_buffer = device.create_buffer(&wgpu::BufferDescriptor {
    label: Some("Delta Time Buffer"),
    size: 4,
    usage: wgpu::BufferUsages::UNIFORM | wgpu::BufferUsages::COPY_DST,
    mapped_at_creation: false,
});
}

The particle buffer has both STORAGE (for the compute shader) and VERTEX (for the render pipeline) usage flags.

Step 6: frame loop

Each frame:

Calculate delta time
Write delta time to the uniform buffer
Run the compute pass to update particles
Run the render pass to draw particles

#![allow(unused)]
fn main() {
// Compute pass
{
    let mut cpass = encoder.begin_compute_pass(&Default::default());
    cpass.set_pipeline(&compute_pipeline);
    cpass.set_bind_group(0, &compute_bind_group, &[]);
    let num_workgroups = (num_particles + 63) / 64; // round up
    cpass.dispatch_workgroups(num_workgroups, 1, 1);
}

// Render pass
{
    let mut rpass = encoder.begin_render_pass(&wgpu::RenderPassDescriptor {
        color_attachments: &[Some(wgpu::RenderPassColorAttachment {
            view: &view,
            resolve_target: None,
            ops: wgpu::Operations {
                load: wgpu::LoadOp::Clear(wgpu::Color::BLACK),
                store: wgpu::StoreOp::Store,
            },
        })],
        ..Default::default()
    });
    rpass.set_pipeline(&render_pipeline);
    rpass.set_bind_group(0, &render_bind_group, &[]);
    rpass.draw(0..num_particles, 0..1);
}
}

Note how dispatch_workgroups rounds up: (10000 + 63) / 64 = 157 work groups, giving 10048 invocations. The bounds check in the shader (if i >= arrayLength(&particles)) prevents the extra 48 threads from accessing out-of-bounds memory.

Expected result

You should see thousands of small green particles bouncing around the window, all updated in parallel on the GPU. With 10,000 particles at 60 FPS, the GPU handles 600,000 particle updates per second with ease — and it could handle millions.

Challenge: add a gravity force (p.velocity.y -= 9.8 * delta_time) and watch the particles fall and bounce off the bottom edge. Or add mouse interaction — pass the mouse position as a uniform and apply a force toward or away from the cursor.

Key takeaway: compute shaders can update large datasets in parallel every frame. By giving a buffer both STORAGE and VERTEX usage flags, you can update data in a compute pass and render it in a render pass without copying between buffers. This compute-then-render pattern is the foundation of GPU-driven simulations.

Part 6 — Going Further

16. Post-processing effects (bloom, blur): conceptual overview

So far, you have rendered directly to the screen. But many visual effects require multi-pass rendering: render the scene to an intermediate texture first, then process that texture in subsequent passes before displaying the final result. This is called post-processing.

Render-to-texture

Instead of targeting the swap chain texture directly, you create an off-screen texture and render to it:

Render-to-Texture
==================

  Pass 1: Render scene                Pass 2: Post-process
  ┌──────────────────┐               ┌──────────────────┐
  │                  │               │                  │
  │  Scene geometry  │──render to──► │  Full-screen     │──render to──► Screen
  │  (3D objects)    │  off-screen   │  quad sampling   │  swap chain
  │                  │  texture      │  the texture     │  texture
  └──────────────────┘               └──────────────────┘

In wgpu, this means creating a wgpu::Texture with RENDER_ATTACHMENT | TEXTURE_BINDING usage. You render to it in pass 1, then sample from it in pass 2.

Bloom effect

Bloom makes bright areas of an image glow, simulating how real cameras and eyes perceive very bright light. The algorithm has three stages:

Bloom Pipeline
==============

  Scene ──► [1. Threshold] ──► [2. Blur] ──► [3. Composite] ──► Final
              Extract           Gaussian        Add blurred
              bright            blur the        bright areas
              pixels            result          back onto
              only                              the original

Stage 1 — Threshold: a fragment shader that outputs only pixels brighter than a threshold, and black for everything else.

@fragment
fn threshold(in: FullscreenInput) -> @location(0) vec4<f32> {
    let color = textureSample(scene_texture, samp, in.uv);
    let brightness = dot(color.rgb, vec3f(0.2126, 0.7152, 0.0722));
    if brightness > 0.8 {
        return color;
    }
    return vec4f(0.0, 0.0, 0.0, 1.0);
}

The dot with (0.2126, 0.7152, 0.0722) computes perceptual luminance — the human eye is most sensitive to green, then red, then blue.

Stage 2 — Gaussian blur: blur the thresholded image so bright spots become soft glows. Gaussian blur is separable — you can split a 2D blur into two 1D passes (horizontal then vertical), which is much faster:

Separable Gaussian Blur
=======================

  Bright     Horizontal    Vertical      Blurred
  pixels ──► blur pass ──► blur pass ──► result
             (1D, left     (1D, up
              to right)     to down)

  A 9x9 2D kernel = 81 samples per pixel
  Two 9-wide 1D kernels = 18 samples per pixel
  Same result, 4.5x faster!

A single-direction blur shader samples several neighbouring texels with Gaussian weights:

@fragment
fn blur_horizontal(in: FullscreenInput) -> @location(0) vec4<f32> {
    let texel_size = 1.0 / f32(textureDimensions(source).x);
    var result = vec4f(0.0);

    // Gaussian weights for a 5-tap kernel
    let weights = array<f32, 5>(0.227, 0.194, 0.122, 0.054, 0.016);
    let offsets = array<f32, 5>(0.0, 1.0, 2.0, 3.0, 4.0);

    for (var i = 0u; i < 5u; i = i + 1u) {
        let offset = vec2f(offsets[i] * texel_size, 0.0);
        result += textureSample(source, samp, in.uv + offset) * weights[i];
        if i > 0u {
            result += textureSample(source, samp, in.uv - offset) * weights[i];
        }
    }
    return result;
}

Stage 3 — Composite: add the blurred bright areas back onto the original scene:

@fragment
fn composite(in: FullscreenInput) -> @location(0) vec4<f32> {
    let scene = textureSample(scene_texture, samp, in.uv);
    let bloom = textureSample(bloom_texture, samp, in.uv);
    return scene + bloom * bloom_intensity;
}

17. Signed Distance Fields for font rendering

Rendering crisp text at any size and rotation is surprisingly difficult with traditional bitmap fonts. Signed Distance Fields (SDFs) provide an elegant solution that gives resolution-independent, anti-aliased text with a single texture.

The problem with bitmap fonts

A bitmap font is a texture where each character is stored as a grid of pixels:

  Bitmap "A" at 32px:       Zoomed in (pixelated):
  ┌────────────┐            ┌──┬──┬──┬──┬──┬──┐
  │    ██      │            │  │  │██│██│  │  │
  │   █  █     │            │  │██│  │  │██│  │
  │  ██████    │            │██│██│██│██│██│██│
  │  █    █    │            │██│  │  │  │  │██│
  │  █    █    │            │██│  │  │  │  │██│
  └────────────┘            └──┴──┴──┴──┴──┴──┘

  Looks fine at 32px.       Looks blocky at 128px.

If you scale the bitmap up, it becomes pixelated. If you scale it down, details are lost. You would need multiple texture sizes, wasting memory.

What is a Signed Distance Field?

An SDF stores, for each texel, the distance to the nearest edge of the shape. Texels inside the shape have negative distances; texels outside have positive distances. The zero-crossing is the exact edge.

  SDF for a circle:

        +3  +2  +1   0  -1  -2  -1   0  +1  +2  +3
        +2  +1   0  -1  -2  -3  -2  -1   0  +1  +2
        +1   0  -1  -2  -3  -4  -3  -2  -1   0  +1
         0  -1  -2  -3  -4  -5  -4  -3  -2  -1   0
        +1   0  -1  -2  -3  -4  -3  -2  -1   0  +1
        +2  +1   0  -1  -2  -3  -2  -1   0  +1  +2
        +3  +2  +1   0  -1  -2  -1   0  +1  +2  +3

        ← 0 is the edge. Negative = inside. Positive = outside.

The key insight is that this distance information contains the shape at any resolution. To render, you simply check whether the distance is negative (inside, draw the character) or positive (outside, draw nothing).

The smoothstep trick

Hard thresholding (inside vs outside) gives jagged edges. The smoothstep function provides perfect anti-aliasing by creating a smooth transition in a narrow band around the edge:

@fragment
fn sdf_text(in: VertexOutput) -> @location(0) vec4<f32> {
    // Sample the SDF texture — value is distance to edge
    let distance = textureSample(sdf_texture, samp, in.uv).r;

    // smoothstep creates a smooth transition near the edge
    // 0.5 is the edge; the range (0.45, 0.55) is the anti-alias band
    let alpha = smoothstep(0.45, 0.55, distance);

    return vec4f(text_color.rgb, alpha);
}

  smoothstep visualised:

  alpha
  1.0 ─────────────────────╮
                             ╲
                              ╲  ← smooth transition
                               ╲    (anti-aliased edge)
  0.0                           ╰───────────────────
      outside    edge    inside
       0.45      0.5      0.55

The width of the transition band can be adjusted. A narrower band gives sharper text; a wider band gives softer text. You can even compute the band width based on the rate of change of the UV coordinates (using fwidth) to get pixel-perfect anti-aliasing at any scale:

let distance = textureSample(sdf_texture, samp, in.uv).r;
let edge = 0.5;
let aa_width = fwidth(distance) * 0.75;
let alpha = smoothstep(edge - aa_width, edge + aa_width, distance);

Advantages of SDF text

Resolution-independent: one small texture (e.g., 64x64 per glyph) looks crisp at any display size
Cheap anti-aliasing: just smoothstep — no multisampling needed
Effects for free: outlines, drop shadows, and glow are trivial to add by adjusting the distance threshold:

// Outline effect
let outline_alpha = smoothstep(0.35, 0.40, distance); // outer edge of outline
let fill_alpha = smoothstep(0.45, 0.55, distance);    // inner fill
let color = mix(outline_color, fill_color, fill_alpha);
let alpha = outline_alpha;

  SDF effects by varying the threshold:

  ┌──────────────────────────────────┐
  │  dist < 0.35  → outside (transparent)  │
  │  0.35 to 0.45 → outline               │
  │  dist > 0.45  → fill (solid text)     │
  └──────────────────────────────────┘

Generating SDF textures

SDF textures are typically pre-generated offline. Tools include:

msdfgen: generates multi-channel SDFs for even sharper edges
Hiero (LibGDX): generates SDF font atlases
fontdue (Rust crate): can generate SDF glyph bitmaps

The generated SDF texture is a single-channel (greyscale) image where 0.5 represents the edge, values above 0.5 are inside the glyph, and values below 0.5 are outside.

Key takeaway: signed distance fields store the distance to a shape’s edge at each texel. This allows rendering crisp, anti-aliased shapes at any resolution from a small texture. The smoothstep function provides the anti-aliasing, and varying the distance threshold enables outlines, glows, and shadows. SDF-based text rendering is used in game engines, mapping applications, and anywhere resolution-independent text is needed.

18. Resources: Learn WGPU, Shadertoy, The Book of Shaders

This section collects the best resources for continuing your shader programming journey. Each resource approaches the topic from a different angle — use them together for a well-rounded education.

Tutorials and courses

Learn WGPU — the definitive tutorial for wgpu in Rust. It walks through window setup, textures, camera systems, lighting, instancing, and more, with complete working code at each step. If you want to build on the exercises in this course, this is the natural next step.

The Book of Shaders by Patricio Gonzalez Vivo and Jen Lowe — a gentle, visual introduction to fragment shaders. It uses GLSL (not WGSL), but the concepts translate directly: noise functions, patterns, colour mixing, shapes, and animation. The interactive editor lets you experiment in real time. Excellent for building shader intuition.

GPU Gems — NVIDIA’s classic book series (available free online). Covers advanced topics like water rendering, subsurface scattering, shadow techniques, and GPU physics. The techniques are presented in HLSL/GLSL but the algorithms are API-agnostic.

WebGPU Fundamentals — explains WebGPU concepts from the ground up with JavaScript examples. Since wgpu implements the WebGPU spec, the API concepts map directly to Rust. Useful for understanding the “why” behind API design decisions.

Interactive playgrounds

Shadertoy — a web-based shader playground where you write fragment shaders (GLSL) and see results immediately. The community has created incredible effects: raymarched landscapes, fluid simulations, fractal zooms, entire games. Study other people’s shaders to learn techniques — the compact format forces creative solutions. You can port Shadertoy ideas to WGSL in your wgpu projects.

WGSL Playground — Google’s Tour of WGSL. An interactive introduction to the WGSL language with runnable examples. Good for quickly testing WGSL syntax.

Specifications and references

WebGPU Specification — the official W3C specification that wgpu implements. Dense but authoritative. Useful when you need to understand exact behaviour.

WGSL Specification — the complete language specification for WGSL. Reference for built-in functions, types, memory models, and grammar.

wgpu documentation (docs.rs) — Rust API documentation for the wgpu crate. Essential reference for looking up function signatures, enum variants, and descriptor fields.

Advanced topics to explore

Once you are comfortable with the basics covered in this course, here are directions to explore:

3D rendering: model-view-projection matrices, depth buffers, camera systems
Lighting: Phong, Blinn-Phong, physically-based rendering (PBR)
Shadow mapping: rendering depth from light’s perspective, shadow comparison
Instancing: drawing thousands of objects efficiently with a single draw call
Raymarching: rendering 3D scenes using signed distance functions (no triangles)
Procedural generation: noise functions (Perlin, Simplex) for terrain, textures, and clouds
Deferred rendering: separating geometry and lighting into different passes
Skeletal animation: vertex skinning with bone matrices

Community

wgpu GitHub — the source code, issue tracker, and examples
WebGPU Matrix channel — real-time chat with the wgpu developers and community
r/rust_gamedev — Rust game development community on Reddit, where wgpu projects are frequently shared

Key takeaway: shader programming is a vast field. Start with Learn WGPU for Rust-specific guidance, The Book of Shaders for visual intuition, and Shadertoy for inspiration. Keep the WGSL spec and wgpu docs.rs handy as references. The GPU programming community is active and welcoming — share your work and learn from others.

Keyboard shortcuts

Vibed Learning