Shader Programming with wgpu and WGSL
This document is a self-guided course on GPU shader programming. It is organised into six parts: the GPU execution model, setting up with wgpu, vertex and fragment shaders, textures and samplers, compute shaders, and a look at where to go next. Each section is either a reading lesson or a hands-on Rust programming exercise.
Table of Contents
Part 1 — The GPU and the Graphics Pipeline
- CPU vs GPU: parallel execution model
- The programmable pipeline: vertex, fragment, compute shaders
- What is WGSL? Syntax overview
Part 2 — Setting Up with wgpu
- What is wgpu? Cross-platform graphics API in Rust
- Exercise 1: create a window and clear it to a colour
- The render loop: swap chains, frames, command encoders
Part 3 — Vertex and Fragment Shaders
- Vertices, buffers, and the vertex shader
- Interpolation and the fragment shader
- Exercise 2: draw a coloured triangle
- Exercise 3: animate the triangle using a time uniform
Part 4 — Textures and Samplers
Part 5 — Compute Shaders
- Compute pipelines: dispatching work groups
- Storage buffers and read/write access from WGSL
- Exercise 5: GPU-accelerate a particle simulation
Part 6 — Going Further
- Post-processing effects (bloom, blur): conceptual overview
- Signed Distance Fields for font rendering
- Resources: Learn WGPU, Shadertoy, The Book of Shaders
Part 1 — The GPU and the Graphics Pipeline
1. CPU vs GPU: parallel execution model
To understand shader programming, you first need to understand why GPUs exist and how they differ from CPUs. The core difference comes down to a design trade-off: latency vs throughput.
The CPU: a few powerful cores
A modern CPU has a small number of cores — typically 4 to 16 on a consumer chip. Each core is highly sophisticated: it has deep pipelines, branch predictors, out-of-order execution, and large caches. This design makes each individual core extremely fast at executing a single sequence of instructions.
CPU (8 cores)
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Core 0 │ │ Core 1 │ │ Core 2 │ │ Core 3 │
│ (complex)│ │ (complex)│ │ (complex)│ │ (complex)│
│ OoO exec │ │ OoO exec │ │ OoO exec │ │ OoO exec │
│ Branch │ │ Branch │ │ Branch │ │ Branch │
│ pred. │ │ pred. │ │ pred. │ │ pred. │
│ L1/L2 │ │ L1/L2 │ │ L1/L2 │ │ L1/L2 │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Core 4 │ │ Core 5 │ │ Core 6 │ │ Core 7 │
│ (complex)│ │ (complex)│ │ (complex)│ │ (complex)│
└──────────┘ └──────────┘ └──────────┘ └──────────┘
CPUs are optimised for low latency — finishing any single task as quickly as possible. This makes them ideal for general-purpose programming: parsing JSON, running game logic, managing operating system tasks.
The GPU: thousands of simple cores
A GPU takes the opposite approach. It packs thousands of tiny, simple cores onto a single chip. Each individual core is much less powerful than a CPU core — no branch prediction, no out-of-order execution, minimal cache. But there are so many of them that the total throughput is enormous.
GPU (thousands of cores)
┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
│ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │
├───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┤
│ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │
├───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┤
│ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │
├───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┤
│ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │
├───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┤
│ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │
├───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┤
│ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │
└───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
Each · is a simple core. Thousands execute in parallel.
GPUs are optimised for high throughput — processing millions of similar operations per second. Each individual operation might be slower than on a CPU, but the sheer volume of parallel work makes up for it.
SIMD vs SIMT
You may have heard of SIMD (Single Instruction, Multiple Data) on CPUs — instructions like SSE or AVX that process 4 or 8 values at once in a single register. GPUs take this idea much further with SIMT (Single Instruction, Multiple Threads).
In SIMT, groups of threads (called warps on NVIDIA or wavefronts on AMD) execute the same instruction at the same time, but each thread operates on different data. A typical warp is 32 threads wide.
SIMT execution (one warp of 32 threads):
Instruction: multiply position by matrix
Thread 0: vertex[0].pos * matrix → result[0]
Thread 1: vertex[1].pos * matrix → result[1]
Thread 2: vertex[2].pos * matrix → result[2]
...
Thread 31: vertex[31].pos * matrix → result[31]
All 32 threads execute the same multiply instruction
at the same clock cycle, on different vertex data.
This is why GPUs are perfect for graphics: every pixel on screen needs the same computation (run the fragment shader), just with different input coordinates. The same applies to vertex transformations, physics simulations, and many other tasks.
When does the GPU win?
The GPU excels when your problem has these characteristics:
- Data parallelism: the same operation is applied to many independent data elements
- Arithmetic intensity: lots of math per memory access
- Predictable control flow: minimal branching (if/else) since all threads in a warp must take the same path
Problems that are sequential, branch-heavy, or have complex data dependencies are better left on the CPU.
Key takeaway: CPUs are fast race cars — great at finishing one task quickly. GPUs are cargo ships — slower per trip, but they move enormous amounts of freight in parallel. Shader programming is the art of loading that cargo ship efficiently.
2. The programmable pipeline: vertex, fragment, compute shaders
Modern GPUs run a programmable graphics pipeline — a fixed sequence of stages where some stages run programs you write (shaders) and others are handled automatically by the hardware. Understanding this pipeline is essential before writing any shader code.
The graphics pipeline
When you ask the GPU to draw a triangle, your data flows through several stages:
The Graphics Pipeline
=====================
CPU (your Rust code)
│
│ Vertex data + draw call
▼
┌─────────────────────┐
│ VERTEX SHADER │ ◄── Programmable (you write this)
│ Transforms each │ Runs once per vertex
│ vertex position │
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ PRIMITIVE │ ◄── Fixed-function (hardware)
│ ASSEMBLY │ Connects vertices into
│ │ triangles, lines, or points
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ RASTERISATION │ ◄── Fixed-function (hardware)
│ Determines which │ Converts triangles into
│ pixels a triangle │ fragments (candidate pixels)
│ covers │
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ FRAGMENT SHADER │ ◄── Programmable (you write this)
│ Computes the │ Runs once per fragment
│ colour of each │ (potential pixel)
│ fragment │
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ OUTPUT MERGER │ ◄── Fixed-function (hardware)
│ Depth test, blend │ Combines fragments into
│ with framebuffer │ the final image
└─────────────────────┘
The vertex shader
The vertex shader runs once for every vertex you submit. Its primary job is to transform vertex positions from model space (the coordinates you defined your mesh in) to clip space (the coordinate system the GPU uses to determine what is on screen).
A vertex shader typically receives input data — position, colour, texture coordinates — and outputs a transformed position plus any data that should be passed to the fragment shader.
For example, a vertex shader might:
- Multiply the vertex position by a model-view-projection matrix
- Pass the vertex colour through to the next stage
- Compute lighting values at each vertex
Rasterisation
After the vertex shader runs and the GPU assembles vertices into triangles, rasterisation determines which screen pixels each triangle covers. This is not programmable — the hardware handles it automatically.
For each pixel covered by a triangle, the rasteriser generates a fragment. A fragment is a candidate pixel — it carries interpolated values from the triangle’s vertices (we will explore interpolation in detail in section 8).
The fragment shader
The fragment shader runs once for every fragment produced by rasterisation. Its job is to determine the final colour of that pixel. This is where most of the visual magic happens: texturing, lighting, shadows, reflections, and special effects are all implemented in the fragment shader.
The fragment shader receives interpolated data from the vertex shader (like colour or texture coordinates) and outputs a colour value, typically as an RGBA (red, green, blue, alpha) tuple.
Compute shaders: a separate path
Compute shaders do not participate in the graphics pipeline at all. They are general-purpose programs that run on the GPU, independent of any rendering. You dispatch them with explicit work-group sizes and they can read from and write to buffers and textures.
Compute Pipeline (independent of graphics)
==========================================
CPU (your Rust code)
│
│ Dispatch (work group counts)
▼
┌─────────────────────┐
│ COMPUTE SHADER │ ◄── Programmable (you write this)
│ General-purpose │ Runs once per invocation
│ parallel work │ across work groups
└─────────────────────┘
│
▼
Output buffers / textures
Compute shaders are used for physics simulations, image processing, machine learning inference, procedural generation, and any task that benefits from massive parallelism but does not need the rasterisation pipeline.
Key takeaway: The GPU has two paths for running your code. The graphics pipeline flows from vertex shader through rasterisation to fragment shader, producing pixels on screen. The compute pipeline is a separate, general-purpose path for parallel computation. You will write programs for all three shader types in this course.
3. What is WGSL? Syntax overview
WGSL (WebGPU Shading Language) is the shader language used by the WebGPU API — and by extension, by wgpu. If you have used GLSL or HLSL before, WGSL will feel familiar but with a more explicit, Rust-influenced syntax. If you are new to shader languages, this section covers everything you need to get started.
Scalar types
WGSL provides a small set of scalar types:
| Type | Description |
|---|---|
f32 | 32-bit floating point |
f16 | 16-bit floating point (optional feature) |
i32 | 32-bit signed integer |
u32 | 32-bit unsigned integer |
bool | Boolean |
Vector types
Vectors are fundamental in shader programming. WGSL supports vectors of 2, 3, or 4 components:
var a: vec2<f32> = vec2<f32>(1.0, 2.0);
var b: vec3<f32> = vec3<f32>(1.0, 0.0, 0.0); // a red colour or a direction
var c: vec4<f32> = vec4<f32>(0.2, 0.4, 0.8, 1.0); // RGBA colour
// Shorthand constructors (type inference):
var d = vec3f(1.0, 0.0, 0.0); // vec3<f32>
var e = vec4f(0.0, 0.0, 0.0, 1.0);
You can access components with swizzling:
var color = vec4f(1.0, 0.5, 0.2, 1.0);
var rgb = color.rgb; // vec3f(1.0, 0.5, 0.2)
var rr = color.xx; // vec2f(1.0, 1.0)
Components can be accessed as x/y/z/w or r/g/b/a — they are interchangeable aliases.
Matrix types
Matrices are used for transformations (rotation, scaling, projection):
// A 4x4 matrix of f32 values (4 columns, 4 rows)
var transform: mat4x4<f32>;
// A 3x3 matrix
var rotation: mat3x3<f32>;
Matrix-vector multiplication uses the * operator: transform * vec4f(pos, 1.0).
Variables: let vs var
// `let` declares an immutable binding (like Rust's `let`)
let pi = 3.14159;
// `var` declares a mutable variable
var counter: u32 = 0u;
counter = counter + 1u;
Structs
Structs group related data, and they are used extensively for shader inputs and outputs:
struct VertexInput {
@location(0) position: vec3<f32>,
@location(1) color: vec3<f32>,
}
struct VertexOutput {
@builtin(position) clip_position: vec4<f32>,
@location(0) color: vec3<f32>,
}
The @location(n) attribute links struct fields to specific slots in the vertex buffer layout or inter-stage communication. The @builtin(position) attribute tells the GPU this field is the clip-space position.
Functions and entry points
WGSL functions look like this:
fn add(a: f32, b: f32) -> f32 {
return a + b;
}
Entry points are functions marked with a stage attribute:
@vertex
fn vs_main(in: VertexInput) -> VertexOutput {
var out: VertexOutput;
out.clip_position = vec4f(in.position, 1.0);
out.color = in.color;
return out;
}
@fragment
fn fs_main(in: VertexOutput) -> @location(0) vec4<f32> {
return vec4f(in.color, 1.0);
}
@compute @workgroup_size(64)
fn cs_main(@builtin(global_invocation_id) id: vec3<u32>) {
// compute work here
}
The @location(0) on the fragment shader return type means “write to the first colour attachment” (the render target).
Built-in attributes
Some commonly used built-in attributes:
| Attribute | Stage | Meaning |
|---|---|---|
@builtin(position) | Vertex out / Fragment in | Clip-space position / fragment coordinates |
@builtin(vertex_index) | Vertex | Index of the current vertex |
@builtin(instance_index) | Vertex | Index of the current instance |
@builtin(global_invocation_id) | Compute | 3D index of this thread in the dispatch |
@builtin(local_invocation_id) | Compute | 3D index within the work group |
Binding resources
Uniforms, storage buffers, textures, and samplers are declared at module scope with @group and @binding attributes:
@group(0) @binding(0)
var<uniform> time: f32;
@group(0) @binding(1)
var texture: texture_2d<f32>;
@group(0) @binding(2)
var tex_sampler: sampler;
The @group(n) corresponds to a bind group index, and @binding(n) is the binding within that group. These must match the bind group layout you define on the Rust side.
Control flow
WGSL supports if/else, for, while, loop, switch, break, continue, and return:
for (var i: u32 = 0u; i < 10u; i = i + 1u) {
if i == 5u {
continue;
}
// do work
}
Key takeaway: WGSL’s syntax is a blend of Rust and C-family languages. Types are explicit, entry points are marked with stage attributes (
@vertex,@fragment,@compute), and data flows between stages via structs annotated with@locationand@builtin. You will write WGSL for every exercise in this course.
Part 2 — Setting Up with wgpu
4. What is wgpu? Cross-platform graphics API in Rust
wgpu is a Rust crate that implements the WebGPU API specification. It provides a safe, cross-platform interface for GPU programming that works on multiple backends:
| Backend | Platform |
|---|---|
| Vulkan | Linux, Windows, Android |
| Metal | macOS, iOS |
| DX12 | Windows |
| WebGPU | Web browsers (via wasm) |
| OpenGL | Fallback for older systems |
This means you write your GPU code once and it runs everywhere — on desktop, on mobile, and in the browser.
Why not raw Vulkan/Metal/DX12?
Writing directly against a low-level graphics API like Vulkan requires thousands of lines of boilerplate before you can draw a single triangle. Vulkan’s explicit nature gives you maximum control, but the complexity is enormous. wgpu provides a higher-level abstraction that handles the platform differences and much of the boilerplate while still being close enough to the metal for serious work.
Key types in wgpu
Here are the core types you will interact with, in the order you typically create them:
Initialization Flow
====================
Instance
│
│ enumerate adapters
▼
Adapter ←── represents a physical GPU
│
│ request device
▼
Device + Queue
│ │
│ │ submit commands
│ ▼
│ (GPU execution)
│
│ create resources
▼
Buffers, Textures, Pipelines, Bind Groups, ...
Instance: the entry point to wgpu. Created first, used to find adapters and create surfaces.Surface: a handle to a window’s drawable area. Created from a window (provided by a windowing library likewinit).Adapter: represents a physical GPU. You request one from the instance, optionally specifying preferences (power preference, compatibility with your surface).Device: a logical connection to the GPU. You create resources (buffers, textures, pipelines) through the device. Think of it as an open connection to the GPU.Queue: used to submit work (command buffers) to the GPU. You get a queue together with the device.CommandEncoder: records GPU commands (render passes, compute dispatches, buffer copies) into a command buffer. The command buffer is then submitted to the queue.RenderPipeline: describes the full configuration for rendering — which shaders to use, vertex layout, blending mode, pixel format, etc.Buffer: a block of GPU-accessible memory. Used for vertex data, index data, uniforms, storage, etc.BindGroup: a collection of resources (buffers, textures, samplers) that are made available to shaders. Corresponds to@group(n)in WGSL.
The initialisation sequence in code
Here is a simplified view of wgpu initialisation (we will see the full code in Exercise 1):
#![allow(unused)]
fn main() {
// 1. Create an instance
let instance = wgpu::Instance::new(&wgpu::InstanceDescriptor::default());
// 2. Create a surface from a window
let surface = instance.create_surface(&window)?;
// 3. Request an adapter (physical GPU)
let adapter = instance
.request_adapter(&wgpu::RequestAdapterOptions {
power_preference: wgpu::PowerPreference::default(),
compatible_surface: Some(&surface),
force_fallback_adapter: false,
})
.await
.unwrap();
// 4. Request a device and queue
let (device, queue) = adapter
.request_device(&wgpu::DeviceDescriptor::default(), None)
.await
.unwrap();
// 5. Configure the surface
let config = surface.get_default_config(&adapter, width, height).unwrap();
surface.configure(&device, &config);
}
After this, you are ready to create pipelines, buffers, and start rendering.
Key takeaway: wgpu is a cross-platform GPU abstraction for Rust. You create an
Instance, get anAdapter(physical GPU), open aDevice+Queue, and then create resources and submit commands. This same code works on Vulkan, Metal, DX12, and WebGPU.
5. Exercise 1: create a window and clear it to a colour
In this exercise you will create a window using winit, initialise wgpu, and fill the window with a solid colour (cornflower blue). This is the “hello world” of GPU programming.
Step 1: project setup
Create a new Rust project and add the required dependencies to Cargo.toml:
[package]
name = "shader-exercises"
version = "0.1.0"
edition = "2021"
[dependencies]
wgpu = "24"
winit = "30"
pollster = "0.4"
log = "0.4"
env_logger = "0.11"
[profile.release]
opt-level = "z"
lto = true
strip = true
codegen-units = 1
wgpu: the GPU abstraction layerwinit: cross-platform window creation and event handlingpollster: a minimal async executor to block on futures (wgpu uses async for initialisation)env_logger: so wgpu can report errors and warnings
Step 2: the complete code
use winit::{
application::ApplicationHandler,
event::WindowEvent,
event_loop::EventLoop,
window::{Window, WindowAttributes},
};
use std::sync::Arc;
/// Holds all wgpu state needed for rendering.
struct GpuState {
surface: wgpu::Surface<'static>,
device: wgpu::Device,
queue: wgpu::Queue,
config: wgpu::SurfaceConfiguration,
}
/// The main application struct.
struct App {
window: Option<Arc<Window>>,
gpu: Option<GpuState>,
}
impl App {
fn new() -> Self {
Self {
window: None,
gpu: None,
}
}
/// Initialise wgpu with the given window.
fn init_gpu(&mut self, window: Arc<Window>) {
let size = window.inner_size();
let instance = wgpu::Instance::new(&wgpu::InstanceDescriptor::default());
let surface = instance.create_surface(window.clone()).unwrap();
let adapter = pollster::block_on(instance.request_adapter(
&wgpu::RequestAdapterOptions {
power_preference: wgpu::PowerPreference::default(),
compatible_surface: Some(&surface),
force_fallback_adapter: false,
},
))
.expect("Failed to find a suitable GPU adapter");
let (device, queue) = pollster::block_on(adapter.request_device(
&wgpu::DeviceDescriptor::default(),
None,
))
.expect("Failed to create device");
let config = surface
.get_default_config(&adapter, size.width.max(1), size.height.max(1))
.expect("Surface is not supported by the adapter");
surface.configure(&device, &config);
self.gpu = Some(GpuState {
surface,
device,
queue,
config,
});
}
/// Render a single frame: clear the screen to cornflower blue.
fn render(&self) {
let gpu = self.gpu.as_ref().unwrap();
// Get the next frame's texture to draw on
let output = gpu.surface.get_current_texture()
.expect("Failed to get surface texture");
let view = output.texture.create_view(&Default::default());
// Create a command encoder to record GPU commands
let mut encoder = gpu.device.create_command_encoder(
&wgpu::CommandEncoderDescriptor {
label: Some("Clear Encoder"),
},
);
// Begin a render pass that clears to cornflower blue
{
let _render_pass = encoder.begin_render_pass(
&wgpu::RenderPassDescriptor {
label: Some("Clear Pass"),
color_attachments: &[Some(
wgpu::RenderPassColorAttachment {
view: &view,
resolve_target: None,
ops: wgpu::Operations {
load: wgpu::LoadOp::Clear(
wgpu::Color {
r: 0.392,
g: 0.584,
b: 0.929,
a: 1.0,
},
),
store: wgpu::StoreOp::Store,
},
},
)],
depth_stencil_attachment: None,
..Default::default()
},
);
// The render pass is dropped here, ending it
}
// Submit the commands to the GPU
gpu.queue.submit(std::iter::once(encoder.finish()));
// Present the frame on screen
output.present();
}
}
impl ApplicationHandler for App {
fn resumed(&mut self, event_loop: &winit::event_loop::ActiveEventLoop) {
if self.window.is_none() {
let attrs = WindowAttributes::default()
.with_title("Exercise 1: Cornflower Blue");
let window = Arc::new(
event_loop.create_window(attrs).unwrap()
);
self.init_gpu(window.clone());
self.window = Some(window);
}
}
fn window_event(
&mut self,
event_loop: &winit::event_loop::ActiveEventLoop,
_window_id: winit::window::WindowId,
event: WindowEvent,
) {
match event {
WindowEvent::CloseRequested => {
event_loop.exit();
}
WindowEvent::Resized(new_size) => {
if let Some(gpu) = &mut self.gpu {
gpu.config.width = new_size.width.max(1);
gpu.config.height = new_size.height.max(1);
gpu.surface.configure(&gpu.device, &gpu.config);
}
}
WindowEvent::RedrawRequested => {
self.render();
if let Some(window) = &self.window {
window.request_redraw();
}
}
_ => {}
}
}
}
fn main() {
env_logger::init();
let event_loop = EventLoop::new().unwrap();
let mut app = App::new();
event_loop.run_app(&mut app).unwrap();
}
Step 3: run it
cargo run
You should see a window filled with cornflower blue (a pleasant mid-blue, rgb(100, 149, 237)). The window responds to resizing and closes when you click the close button.
What just happened?
Let’s break down the key parts:
- Window creation:
winitcreates a native window. We wrap it inArcso wgpu can reference it. - Surface: created from the window — this is where rendered frames go.
- Adapter + Device + Queue: we find a GPU, open a logical device, and get a command queue.
- Surface configuration: tells the surface what pixel format and size to use.
- Render loop: every frame we create a
CommandEncoder, begin aRenderPasswith a clear colour, end the pass, submit commands, and present.
The clear colour is specified as wgpu::Color { r, g, b, a } with values in the 0.0-1.0 range.
Try this: change the colour to something else — pure red (1.0, 0.0, 0.0, 1.0), bright green, or your favourite colour. Rebuild and see the change.
6. The render loop: swap chains, frames, command encoders
Now that you have a working window, let’s dive deeper into what happens each frame. Understanding the render loop is crucial because every shader program you write will run inside this cycle.
The frame lifecycle
Every frame follows the same sequence. Here is what happens between one screen update and the next:
Frame Lifecycle
===============
Time ──────────────────────────────────────────────────►
┌──── Frame N ─────────────────────┐ ┌── Frame N+1 ──
│ │ │
│ 1. Acquire 2. Record 3. Submit 4. Present
│ surface commands to to
│ texture (render queue screen
│ pass)
│ │ │
│ CPU CPU │ GPU executes
│ side side │ asynchronously
└──────────────────────────────────┘ └────────────────
Step 1: acquire a surface texture
#![allow(unused)]
fn main() {
let output = surface.get_current_texture()?;
let view = output.texture.create_view(&Default::default());
}
The surface manages a small pool of textures (typically 2-3, called a swap chain). When you call get_current_texture(), you receive the next available texture to draw on. While you are drawing on texture A, the GPU may still be displaying the previous texture B on screen — this is double buffering.
Double Buffering
================
┌────────────┐ ┌────────────┐
│ Texture A │ │ Texture B │
│ (drawing) │ │ (on screen)│
└────────────┘ └────────────┘
▲ ▲
│ │
You render Monitor
into this displays
one now this one
After you present texture A, the roles swap: A goes to the screen and B becomes available for the next frame.
Step 2: record commands with a command encoder
#![allow(unused)]
fn main() {
let mut encoder = device.create_command_encoder(&Default::default());
}
The CommandEncoder is like a tape recorder for GPU commands. You do not execute anything immediately — you record a list of operations, and then submit them all at once. This is called a command buffer model.
Why not execute commands immediately? Because the GPU operates asynchronously. Batching commands into a buffer lets the GPU execute them efficiently without constant back-and-forth with the CPU.
Step 3: begin a render pass
#![allow(unused)]
fn main() {
let render_pass = encoder.begin_render_pass(&wgpu::RenderPassDescriptor {
color_attachments: &[Some(wgpu::RenderPassColorAttachment {
view: &view,
ops: wgpu::Operations {
load: wgpu::LoadOp::Clear(clear_color),
store: wgpu::StoreOp::Store,
},
..Default::default()
})],
..Default::default()
});
}
A render pass is a sequence of draw commands that all target the same set of attachments (colour textures, depth buffers). Within a render pass, you:
- Set the pipeline
- Bind vertex buffers and bind groups
- Issue draw calls
The load operation specifies what happens to the attachment at the start of the pass. LoadOp::Clear(color) fills it with a solid colour. LoadOp::Load preserves the previous contents.
The store operation specifies what happens at the end. StoreOp::Store keeps the results; StoreOp::Discard throws them away (useful for depth buffers you do not need after the pass).
Step 4: submit and present
#![allow(unused)]
fn main() {
// End the render pass (drop it)
drop(render_pass);
// Finish recording and get a command buffer
let command_buffer = encoder.finish();
// Submit the command buffer to the GPU
queue.submit(std::iter::once(command_buffer));
// Show the rendered frame on screen
output.present();
}
queue.submit() sends the command buffer to the GPU for execution. The GPU processes it asynchronously — your CPU code continues immediately. output.present() tells the surface to display this texture once the GPU finishes rendering to it.
Multiple render passes
You can have multiple render passes in a single frame. This is common for:
- Shadow mapping: render the scene from a light’s perspective (pass 1), then render the final image using the shadow map (pass 2)
- Post-processing: render the scene to a texture (pass 1), then apply a blur filter to that texture and draw the result to the screen (pass 2)
Each pass gets its own begin_render_pass / drop cycle within the same command encoder.
Key takeaway: each frame, you acquire a surface texture, record GPU commands into a command encoder (including one or more render passes), submit the commands to the queue, and present the result. The CPU and GPU work asynchronously — the CPU records commands for the next frame while the GPU executes the current one.
Part 3 — Vertex and Fragment Shaders
7. Vertices, buffers, and the vertex shader
To draw anything beyond a solid colour, you need to send geometry to the GPU. Geometry is made of vertices — points in space that define the corners of triangles. This section explains how vertex data flows from your Rust code to the vertex shader on the GPU.
What is a vertex?
A vertex is a point with associated data. At minimum, a vertex has a position, but it usually carries additional attributes:
Vertex Data (per vertex)
========================
┌──────────────────────────────────────────────┐
│ position: vec3<f32> (x, y, z) │
│ color: vec3<f32> (r, g, b) │
│ uv: vec2<f32> (texture coordinate) │
│ normal: vec3<f32> (surface direction) │
└──────────────────────────────────────────────┘
For a simple coloured triangle, you might have three vertices with position and colour:
#![allow(unused)]
fn main() {
#[repr(C)]
#[derive(Copy, Clone, bytemuck::Pod, bytemuck::Zeroable)]
struct Vertex {
position: [f32; 3],
color: [f32; 3],
}
const VERTICES: &[Vertex] = &[
Vertex { position: [ 0.0, 0.5, 0.0], color: [1.0, 0.0, 0.0] }, // top, red
Vertex { position: [-0.5, -0.5, 0.0], color: [0.0, 1.0, 0.0] }, // left, green
Vertex { position: [ 0.5, -0.5, 0.0], color: [0.0, 0.0, 1.0] }, // right, blue
];
}
The #[repr(C)] attribute ensures the struct has a predictable memory layout matching what the GPU expects. The bytemuck derives let us safely cast the struct to raw bytes.
Vertex buffers
To get vertex data onto the GPU, you create a vertex buffer:
#![allow(unused)]
fn main() {
use wgpu::util::DeviceExt;
let vertex_buffer = device.create_buffer_init(&wgpu::util::BufferInitDescriptor {
label: Some("Vertex Buffer"),
contents: bytemuck::cast_slice(VERTICES),
usage: wgpu::BufferUsages::VERTEX,
});
}
This copies the vertex data from CPU memory into GPU memory. The VERTEX usage flag tells wgpu that this buffer will be used as a vertex buffer.
Vertex buffer layout
The GPU does not know the structure of your vertex data — you must describe it with a vertex buffer layout:
#![allow(unused)]
fn main() {
let vertex_layout = wgpu::VertexBufferLayout {
array_stride: std::mem::size_of::<Vertex>() as u64,
step_mode: wgpu::VertexStepMode::Vertex,
attributes: &[
// position: 3 floats at offset 0
wgpu::VertexAttribute {
format: wgpu::VertexFormat::Float32x3,
offset: 0,
shader_location: 0,
},
// color: 3 floats at offset 12 bytes (after 3 x f32)
wgpu::VertexAttribute {
format: wgpu::VertexFormat::Float32x3,
offset: 12,
shader_location: 1,
},
],
};
}
This tells the GPU: “each vertex is N bytes apart (array_stride), and within each vertex, location 0 is three floats starting at byte 0, and location 1 is three floats starting at byte 12.”
The shader_location values correspond to @location(n) in your WGSL shader.
How data flows from CPU to vertex shader
CPU Memory GPU Memory Vertex Shader
========== ========== =============
Vertex array copy Vertex buffer read @location(0) position
[pos, color] ──────────► [bytes...] ──────────► @location(1) color
[pos, color] [bytes...]
[pos, color] [bytes...]
The layout descriptor tells the GPU how to
interpret the bytes into typed attributes.
The vertex shader’s job
The vertex shader runs once per vertex. It must output a @builtin(position) value in clip space — a coordinate system where:
xranges from -1 (left) to +1 (right)yranges from -1 (bottom) to +1 (top)zranges from 0 (near) to 1 (far)
Anything outside these ranges is clipped (not drawn).
struct VertexInput {
@location(0) position: vec3<f32>,
@location(1) color: vec3<f32>,
}
struct VertexOutput {
@builtin(position) clip_position: vec4<f32>,
@location(0) color: vec3<f32>,
}
@vertex
fn vs_main(in: VertexInput) -> VertexOutput {
var out: VertexOutput;
out.clip_position = vec4f(in.position, 1.0);
out.color = in.color;
return out;
}
In this simple shader, the position passes through unchanged (we are already working in clip space). In real applications, you would multiply by a model-view-projection matrix to transform from 3D world coordinates to clip space.
Key takeaway: vertices carry per-point data (position, colour, etc.) packed into a buffer. The vertex buffer layout tells the GPU how to decode the bytes. The vertex shader transforms each vertex’s position into clip space and passes any additional data (like colour) to the next stage.
8. Interpolation and the fragment shader
After the vertex shader has transformed all vertices and the GPU has assembled them into triangles, rasterisation takes over. This section explains what happens between the vertex shader and the fragment shader — the critical concept of interpolation.
From triangles to pixels
Rasterisation determines which pixels on screen fall inside each triangle. For each pixel inside a triangle, the rasteriser generates a fragment. But what data does each fragment carry?
Consider a triangle with a red vertex, a green vertex, and a blue vertex:
Red (1,0,0)
/\
/ \
/ \
/ what \
/ colour \
/ is this \
/ pixel? \
/______________\
Green Blue
(0,1,0) (0,0,1)
A pixel near the red vertex should be mostly red. A pixel exactly in the centre should be an equal mix of red, green, and blue. The GPU computes this automatically using barycentric interpolation.
Barycentric coordinates
Every point inside a triangle can be expressed as a weighted combination of the three vertices. These weights are called barycentric coordinates (w0, w1, w2), where:
- w0 + w1 + w2 = 1.0
- All weights are between 0 and 1
Barycentric Interpolation
=========================
Point P inside triangle ABC:
P = w0 * A + w1 * B + w2 * C
At vertex A: w0=1, w1=0, w2=0 → colour = A.color
At vertex B: w0=0, w1=1, w2=0 → colour = B.color
At vertex C: w0=0, w1=0, w2=1 → colour = C.color
At centre: w0=⅓, w1=⅓, w2=⅓ → colour = average
The GPU performs this interpolation automatically for every field in the VertexOutput struct (except @builtin(position), which is used for rasterisation itself). This means colours, texture coordinates, normals — everything — gets smoothly interpolated across the triangle surface.
The fragment shader
The fragment shader runs once for each fragment generated by rasterisation. It receives the interpolated values from the vertex shader and outputs a colour:
@fragment
fn fs_main(in: VertexOutput) -> @location(0) vec4<f32> {
return vec4f(in.color, 1.0);
}
In this simple example, the fragment shader just passes through the interpolated colour. But you can do much more:
- Sample a texture using interpolated UV coordinates
- Apply lighting calculations using interpolated normals
- Compute procedural patterns based on the fragment position
- Discard fragments to create transparency cutouts
What the fragment receives
The fragment shader’s input looks like the vertex shader’s output, but the values have been interpolated:
Vertex Shader Output Fragment Shader Input
==================== ====================
Vertex 0: color=(1,0,0)
Vertex 1: color=(0,1,0) ──► Fragment at centre:
Vertex 2: color=(0,0,1) color=(0.33, 0.33, 0.33)
The @builtin(position) field in the fragment shader’s input contains the fragment’s window-space coordinates — (x, y) in pixel coordinates. This can be useful for screen-space effects.
Visual result
When you draw our red-green-blue triangle, the interpolation produces a smooth colour gradient:
Expected visual output:
┌───────────────────────┐
│ │
│ ▲ Red │
│ ╱ ╲ │
│ ╱ ╲ │
│ ╱ gra ╲ │
│ ╱ dient ╲ │
│ ╱─────────╲ │
│ Green Blue │
│ │
└───────────────────────┘
Colours blend smoothly across the
triangle surface via interpolation.
Key takeaway: the GPU automatically interpolates all vertex shader outputs across the triangle surface using barycentric coordinates. The fragment shader receives these smoothly interpolated values and uses them to compute the final pixel colour. This is why a triangle with three different vertex colours produces a smooth gradient.
9. Exercise 2: draw a coloured triangle
Time to put theory into practice. In this exercise you will extend the Exercise 1 code to draw a coloured triangle with red, green, and blue vertices.
What you will add
- A WGSL shader with vertex and fragment entry points
- A vertex buffer with three coloured vertices
- A render pipeline that connects everything
- A draw call inside the render pass
Step 1: add bytemuck to Cargo.toml
[dependencies]
wgpu = { version = "24", features = ["wgsl"] }
winit = "30"
pollster = "0.4"
log = "0.4"
env_logger = "0.11"
bytemuck = { version = "1", features = ["derive"] }
Step 2: the WGSL shader
Create a file called shader.wgsl in the same directory as main.rs (or embed it as a string — we will embed it here for simplicity):
// Vertex input: position and colour from the vertex buffer
struct VertexInput {
@location(0) position: vec3<f32>,
@location(1) color: vec3<f32>,
}
// Output from vertex shader, input to fragment shader
struct VertexOutput {
@builtin(position) clip_position: vec4<f32>,
@location(0) color: vec3<f32>,
}
@vertex
fn vs_main(in: VertexInput) -> VertexOutput {
var out: VertexOutput;
out.clip_position = vec4f(in.position, 1.0);
out.color = in.color;
return out;
}
@fragment
fn fs_main(in: VertexOutput) -> @location(0) vec4<f32> {
return vec4f(in.color, 1.0);
}
Step 3: the Rust code
Below is the complete program. It extends Exercise 1 with a vertex buffer, pipeline, and draw call:
use winit::{
application::ApplicationHandler,
event::WindowEvent,
event_loop::EventLoop,
window::{Window, WindowAttributes},
};
use wgpu::util::DeviceExt;
use std::sync::Arc;
// Vertex data structure — must match the shader's VertexInput
#[repr(C)]
#[derive(Copy, Clone, bytemuck::Pod, bytemuck::Zeroable)]
struct Vertex {
position: [f32; 3],
color: [f32; 3],
}
impl Vertex {
/// Describe the memory layout for the GPU.
fn layout() -> wgpu::VertexBufferLayout<'static> {
wgpu::VertexBufferLayout {
array_stride: std::mem::size_of::<Vertex>() as u64,
step_mode: wgpu::VertexStepMode::Vertex,
attributes: &[
wgpu::VertexAttribute {
format: wgpu::VertexFormat::Float32x3,
offset: 0,
shader_location: 0, // @location(0) position
},
wgpu::VertexAttribute {
format: wgpu::VertexFormat::Float32x3,
offset: 12,
shader_location: 1, // @location(1) color
},
],
}
}
}
// Three vertices forming a coloured triangle
const VERTICES: &[Vertex] = &[
Vertex { position: [ 0.0, 0.5, 0.0], color: [1.0, 0.0, 0.0] }, // top — red
Vertex { position: [-0.5, -0.5, 0.0], color: [0.0, 1.0, 0.0] }, // bottom-left — green
Vertex { position: [ 0.5, -0.5, 0.0], color: [0.0, 0.0, 1.0] }, // bottom-right — blue
];
struct GpuState {
surface: wgpu::Surface<'static>,
device: wgpu::Device,
queue: wgpu::Queue,
config: wgpu::SurfaceConfiguration,
pipeline: wgpu::RenderPipeline,
vertex_buffer: wgpu::Buffer,
}
struct App {
window: Option<Arc<Window>>,
gpu: Option<GpuState>,
}
impl App {
fn new() -> Self {
Self { window: None, gpu: None }
}
fn init_gpu(&mut self, window: Arc<Window>) {
let size = window.inner_size();
let instance = wgpu::Instance::new(&Default::default());
let surface = instance.create_surface(window.clone()).unwrap();
let adapter = pollster::block_on(instance.request_adapter(
&wgpu::RequestAdapterOptions {
compatible_surface: Some(&surface),
..Default::default()
},
)).unwrap();
let (device, queue) = pollster::block_on(
adapter.request_device(&Default::default(), None)
).unwrap();
let config = surface
.get_default_config(&adapter, size.width.max(1), size.height.max(1))
.unwrap();
surface.configure(&device, &config);
// Create the shader module from WGSL source
let shader = device.create_shader_module(wgpu::ShaderModuleDescriptor {
label: Some("Triangle Shader"),
source: wgpu::ShaderSource::Wgsl(include_str!("shader.wgsl").into()),
});
// Create the render pipeline
let pipeline_layout = device.create_pipeline_layout(
&wgpu::PipelineLayoutDescriptor {
label: Some("Pipeline Layout"),
bind_group_layouts: &[],
push_constant_ranges: &[],
},
);
let pipeline = device.create_render_pipeline(
&wgpu::RenderPipelineDescriptor {
label: Some("Triangle Pipeline"),
layout: Some(&pipeline_layout),
vertex: wgpu::VertexState {
module: &shader,
entry_point: Some("vs_main"),
buffers: &[Vertex::layout()],
compilation_options: Default::default(),
},
fragment: Some(wgpu::FragmentState {
module: &shader,
entry_point: Some("fs_main"),
targets: &[Some(wgpu::ColorTargetState {
format: config.format,
blend: Some(wgpu::BlendState::REPLACE),
write_mask: wgpu::ColorWrites::ALL,
})],
compilation_options: Default::default(),
}),
primitive: wgpu::PrimitiveState {
topology: wgpu::PrimitiveTopology::TriangleList,
strip_index_format: None,
front_face: wgpu::FrontFace::Ccw,
cull_mode: Some(wgpu::Face::Back),
unclipped_depth: false,
polygon_mode: wgpu::PolygonMode::Fill,
conservative: false,
},
depth_stencil: None,
multisample: wgpu::MultisampleState::default(),
multiview: None,
cache: None,
},
);
// Create the vertex buffer
let vertex_buffer = device.create_buffer_init(
&wgpu::util::BufferInitDescriptor {
label: Some("Vertex Buffer"),
contents: bytemuck::cast_slice(VERTICES),
usage: wgpu::BufferUsages::VERTEX,
},
);
self.gpu = Some(GpuState {
surface, device, queue, config, pipeline, vertex_buffer,
});
}
fn render(&self) {
let gpu = self.gpu.as_ref().unwrap();
let output = gpu.surface.get_current_texture().unwrap();
let view = output.texture.create_view(&Default::default());
let mut encoder = gpu.device.create_command_encoder(&Default::default());
{
let mut pass = encoder.begin_render_pass(&wgpu::RenderPassDescriptor {
label: Some("Triangle Pass"),
color_attachments: &[Some(wgpu::RenderPassColorAttachment {
view: &view,
resolve_target: None,
ops: wgpu::Operations {
load: wgpu::LoadOp::Clear(wgpu::Color {
r: 0.1, g: 0.1, b: 0.1, a: 1.0,
}),
store: wgpu::StoreOp::Store,
},
})],
depth_stencil_attachment: None,
..Default::default()
});
pass.set_pipeline(&gpu.pipeline);
pass.set_vertex_buffer(0, gpu.vertex_buffer.slice(..));
pass.draw(0..3, 0..1); // 3 vertices, 1 instance
}
gpu.queue.submit(std::iter::once(encoder.finish()));
output.present();
}
}
impl ApplicationHandler for App {
fn resumed(&mut self, event_loop: &winit::event_loop::ActiveEventLoop) {
if self.window.is_none() {
let window = Arc::new(
event_loop.create_window(
WindowAttributes::default().with_title("Exercise 2: Coloured Triangle")
).unwrap()
);
self.init_gpu(window.clone());
self.window = Some(window);
}
}
fn window_event(
&mut self,
event_loop: &winit::event_loop::ActiveEventLoop,
_id: winit::window::WindowId,
event: WindowEvent,
) {
match event {
WindowEvent::CloseRequested => event_loop.exit(),
WindowEvent::Resized(size) => {
if let Some(gpu) = &mut self.gpu {
gpu.config.width = size.width.max(1);
gpu.config.height = size.height.max(1);
gpu.surface.configure(&gpu.device, &gpu.config);
}
}
WindowEvent::RedrawRequested => {
self.render();
if let Some(w) = &self.window { w.request_redraw(); }
}
_ => {}
}
}
}
fn main() {
env_logger::init();
let event_loop = EventLoop::new().unwrap();
event_loop.run_app(&mut App::new()).unwrap();
}
Step 4: run and observe
cargo run
You should see a triangle centred in the window with a smooth gradient: red at the top, green at the bottom-left, and blue at the bottom-right. The colours blend smoothly across the surface thanks to the interpolation discussed in section 8.
Key concepts demonstrated
- Vertex struct with
#[repr(C)]andbytemuckfor safe casting to bytes - Vertex buffer layout mapping struct fields to
@location(n)in the shader - Shader module loaded from WGSL source via
include_str! - Render pipeline connecting shaders, vertex layout, and output format
- Draw call (
pass.draw(0..3, 0..1)) telling the GPU to process 3 vertices as one triangle
Challenge: add three more vertices and draw a second triangle to form a rectangle. You will need 6 vertices total (two triangles of 3 vertices each) and change the draw call to pass.draw(0..6, 0..1).
10. Exercise 3: animate the triangle using a time uniform
Static shapes are nice, but animation is where shaders really shine. In this exercise you will pass an elapsed time value to the vertex shader and use it to rotate the triangle.
New concepts
- Uniform buffers: small, read-only buffers for data that is the same across all vertices/fragments in a draw call (like time, camera matrices, light positions)
- Bind groups: how you connect uniform buffers (and other resources) to shader bindings
- Updating buffers: writing new data to a buffer each frame
Step 1: the updated WGSL shader
struct VertexInput {
@location(0) position: vec3<f32>,
@location(1) color: vec3<f32>,
}
struct VertexOutput {
@builtin(position) clip_position: vec4<f32>,
@location(0) color: vec3<f32>,
}
// A uniform buffer containing the elapsed time
@group(0) @binding(0)
var<uniform> time: f32;
@vertex
fn vs_main(in: VertexInput) -> VertexOutput {
// Rotate the vertex around the Z axis
let angle = time;
let cos_a = cos(angle);
let sin_a = sin(angle);
let rotated = vec3f(
in.position.x * cos_a - in.position.y * sin_a,
in.position.x * sin_a + in.position.y * cos_a,
in.position.z,
);
var out: VertexOutput;
out.clip_position = vec4f(rotated, 1.0);
out.color = in.color;
return out;
}
@fragment
fn fs_main(in: VertexOutput) -> @location(0) vec4<f32> {
return vec4f(in.color, 1.0);
}
The key change is the time uniform and the 2D rotation matrix applied to each vertex. The rotation is:
x' = x * cos(angle) - y * sin(angle)
y' = x * sin(angle) + y * cos(angle)
This rotates the triangle around the origin (centre of clip space) at one radian per second.
Step 2: create the uniform buffer and bind group
On the Rust side, you need to:
- Create a buffer for the time value
- Create a bind group layout describing the binding
- Create a bind group linking the buffer to the layout
- Update the pipeline layout to include the bind group layout
#![allow(unused)]
fn main() {
use std::time::Instant;
// Create the uniform buffer (4 bytes for one f32)
let time_buffer = device.create_buffer(&wgpu::BufferDescriptor {
label: Some("Time Uniform Buffer"),
size: std::mem::size_of::<f32>() as u64,
usage: wgpu::BufferUsages::UNIFORM | wgpu::BufferUsages::COPY_DST,
mapped_at_creation: false,
});
// Create a bind group layout
let bind_group_layout = device.create_bind_group_layout(
&wgpu::BindGroupLayoutDescriptor {
label: Some("Time Bind Group Layout"),
entries: &[wgpu::BindGroupLayoutEntry {
binding: 0,
visibility: wgpu::ShaderStages::VERTEX,
ty: wgpu::BindingType::Buffer {
ty: wgpu::BufferBindingType::Uniform,
has_dynamic_offset: false,
min_binding_size: None,
},
count: None,
}],
},
);
// Create the bind group
let time_bind_group = device.create_bind_group(&wgpu::BindGroupDescriptor {
label: Some("Time Bind Group"),
layout: &bind_group_layout,
entries: &[wgpu::BindGroupEntry {
binding: 0,
resource: time_buffer.as_entire_binding(),
}],
});
// Update the pipeline layout to include our bind group
let pipeline_layout = device.create_pipeline_layout(
&wgpu::PipelineLayoutDescriptor {
label: Some("Animated Pipeline Layout"),
bind_group_layouts: &[&bind_group_layout],
push_constant_ranges: &[],
},
);
}
Step 3: update the buffer each frame
In your render function, before beginning the render pass, write the current time to the buffer:
#![allow(unused)]
fn main() {
let elapsed = self.start_time.elapsed().as_secs_f32();
gpu.queue.write_buffer(&gpu.time_buffer, 0, bytemuck::cast_slice(&[elapsed]));
}
queue.write_buffer copies data from CPU memory into the GPU buffer. This is the simplest way to update a uniform each frame.
Step 4: bind the group in the render pass
Inside your render pass, after setting the pipeline:
#![allow(unused)]
fn main() {
pass.set_pipeline(&gpu.pipeline);
pass.set_bind_group(0, &gpu.time_bind_group, &[]); // group 0
pass.set_vertex_buffer(0, gpu.vertex_buffer.slice(..));
pass.draw(0..3, 0..1);
}
The set_bind_group(0, ...) call makes the time buffer available to the shader as @group(0) @binding(0).
Expected result
When you run the program, you should see the coloured triangle smoothly rotating around the centre of the window. The triangle completes one full rotation every 2*pi (approximately 6.28) seconds.
Understanding the flow
Each frame:
┌─────────────────────────────────────────────────────────────┐
│ │
│ CPU: elapsed = Instant::now() - start │
│ queue.write_buffer(time_buffer, elapsed) │
│ │
│ GPU: time uniform ← time_buffer │
│ for each vertex: │
│ rotated_pos = rotate(vertex.pos, time) │
│ output clip_position = rotated_pos │
│ │
│ Result: triangle rotates smoothly │
└─────────────────────────────────────────────────────────────┘
Challenge: instead of (or in addition to) rotating, try making the triangle pulse in size using sin(time) as a scale factor. Or make it bounce by adding sin(time) * 0.3 to the y position.
Key takeaway: uniform buffers let you pass per-frame data (time, matrices, parameters) to shaders. You create a buffer, describe its layout in a bind group, bind it during the render pass, and access it in WGSL via
@group(n) @binding(n). This is how you make shaders dynamic.
Part 4 — Textures and Samplers
11. Texture coordinates (UVs), texture creation, sampler config
Solid colours and gradients are a start, but most real-world graphics use textures — images mapped onto surfaces. This section explains how textures work, how UV coordinates map image data onto geometry, and how samplers control the lookup.
What are UV coordinates?
UV coordinates (also called texture coordinates) describe where on a texture each vertex should sample from. They range from (0, 0) at the top-left of the texture to (1, 1) at the bottom-right:
Texture UV Space
================
(0,0)────────────────(1,0)
│ │
│ ┌──────────┐ │
│ │ │ │
│ │ image │ │
│ │ data │ │
│ │ │ │
│ └──────────┘ │
│ │
(0,1)────────────────(1,1)
Note: in wgpu/WebGPU, (0,0) is the top-left
and v increases downward.
Each vertex carries a UV coordinate. When the GPU rasterises a triangle, it interpolates these UVs across the surface (just like it interpolates colours). The fragment shader then uses the interpolated UV to look up a colour from the texture.
Quad with UV mapping
====================
Vertex 0: pos=(-0.5, 0.5) uv=(0, 0) ← top-left
Vertex 1: pos=( 0.5, 0.5) uv=(1, 0) ← top-right
Vertex 2: pos=(-0.5,-0.5) uv=(0, 1) ← bottom-left
Vertex 3: pos=( 0.5,-0.5) uv=(1, 1) ← bottom-right
The full texture maps exactly onto the quad.
Creating a texture in wgpu
To use a texture, you need to:
- Create the texture on the GPU
- Upload the image data
- Create a texture view for accessing it in shaders
- Create a sampler that controls how texels are looked up
#![allow(unused)]
fn main() {
// Step 1: Create the texture
let texture = device.create_texture(&wgpu::TextureDescriptor {
label: Some("My Texture"),
size: wgpu::Extent3d {
width: img_width,
height: img_height,
depth_or_array_layers: 1,
},
mip_level_count: 1,
sample_count: 1,
dimension: wgpu::TextureDimension::D2,
format: wgpu::TextureFormat::Rgba8UnormSrgb,
usage: wgpu::TextureUsages::TEXTURE_BINDING | wgpu::TextureUsages::COPY_DST,
view_formats: &[],
});
// Step 2: Upload the pixel data
queue.write_texture(
wgpu::TexelCopyTextureInfo {
texture: &texture,
mip_level: 0,
origin: wgpu::Origin3d::ZERO,
aspect: wgpu::TextureAspect::All,
},
&rgba_bytes, // &[u8] of RGBA pixel data
wgpu::TexelCopyBufferLayout {
offset: 0,
bytes_per_row: Some(4 * img_width),
rows_per_image: Some(img_height),
},
wgpu::Extent3d {
width: img_width,
height: img_height,
depth_or_array_layers: 1,
},
);
// Step 3: Create a view
let texture_view = texture.create_view(&Default::default());
}
Sampler configuration
A sampler controls how the GPU looks up texels (texture pixels) when the UV does not land exactly on a texel centre. There are two key settings:
Filtering controls how texels are blended:
Nearest: picks the closest texel (pixelated look, fast)Linear: blends the four nearest texels (smooth look)
Nearest filtering Linear filtering
================== =================
┌───┬───┬───┐ ┌───┬───┬───┐
│ A │ B │ │ │ A │ B │ │
├───┼───┼───┤ ├───┼╌╌╌┼───┤
│ C │ D │ │ │ C │avg│ │
├───┼───┼───┤ ├───┼───┼───┤
│ │ │ │ │ │ │ │
└───┴───┴───┘ └───┴───┴───┘
Nearest: picks one Linear: blends A,B,C,D
texel (e.g., A) based on distance
Address mode (wrapping) controls what happens when UVs go outside the 0-1 range:
ClampToEdge: UVs outside 0-1 use the edge colourRepeat: the texture tilesMirrorRepeat: the texture tiles, flipping every other repetition
#![allow(unused)]
fn main() {
let sampler = device.create_sampler(&wgpu::SamplerDescriptor {
label: Some("Texture Sampler"),
address_mode_u: wgpu::AddressMode::ClampToEdge,
address_mode_v: wgpu::AddressMode::ClampToEdge,
address_mode_w: wgpu::AddressMode::ClampToEdge,
mag_filter: wgpu::FilterMode::Linear,
min_filter: wgpu::FilterMode::Linear,
mipmap_filter: wgpu::FilterMode::Nearest,
..Default::default()
});
}
Bind groups for textures
Textures and samplers are bound to shaders using bind groups, just like uniform buffers:
#![allow(unused)]
fn main() {
let bind_group_layout = device.create_bind_group_layout(
&wgpu::BindGroupLayoutDescriptor {
label: Some("Texture Bind Group Layout"),
entries: &[
// The texture
wgpu::BindGroupLayoutEntry {
binding: 0,
visibility: wgpu::ShaderStages::FRAGMENT,
ty: wgpu::BindingType::Texture {
sample_type: wgpu::TextureSampleType::Float { filterable: true },
view_dimension: wgpu::TextureViewDimension::D2,
multisampled: false,
},
count: None,
},
// The sampler
wgpu::BindGroupLayoutEntry {
binding: 1,
visibility: wgpu::ShaderStages::FRAGMENT,
ty: wgpu::BindingType::Sampler(
wgpu::SamplerBindingType::Filtering,
),
count: None,
},
],
},
);
}
In WGSL, you access them like this:
@group(0) @binding(0)
var t_diffuse: texture_2d<f32>;
@group(0) @binding(1)
var s_diffuse: sampler;
@fragment
fn fs_main(in: VertexOutput) -> @location(0) vec4<f32> {
return textureSample(t_diffuse, s_diffuse, in.uv);
}
The textureSample function takes a texture, a sampler, and UV coordinates, and returns the sampled colour.
Key takeaway: textures are images stored on the GPU. UV coordinates map texture space onto geometry. Samplers control filtering (nearest vs linear) and wrapping behaviour. The fragment shader uses
textureSampleto look up a colour from the texture at interpolated UV coordinates.
12. Exercise 4: render a textured quad
In this exercise you will draw a rectangle (two triangles forming a quad) with a texture mapped onto it. You will create a procedural checkerboard texture in code rather than loading an image file, keeping the exercise self-contained.
Step 1: add dependencies
We do not need an image loading crate for this exercise since we generate the texture procedurally. The same Cargo.toml from Exercise 2 works, with bytemuck already included.
Step 2: the WGSL shader
struct VertexInput {
@location(0) position: vec3<f32>,
@location(1) uv: vec2<f32>,
}
struct VertexOutput {
@builtin(position) clip_position: vec4<f32>,
@location(0) uv: vec2<f32>,
}
@vertex
fn vs_main(in: VertexInput) -> VertexOutput {
var out: VertexOutput;
out.clip_position = vec4f(in.position, 1.0);
out.uv = in.uv;
return out;
}
@group(0) @binding(0)
var t_texture: texture_2d<f32>;
@group(0) @binding(1)
var s_sampler: sampler;
@fragment
fn fs_main(in: VertexOutput) -> @location(0) vec4<f32> {
return textureSample(t_texture, s_sampler, in.uv);
}
Note how the vertex now carries a vec2<f32> UV coordinate instead of a colour. The fragment shader samples the texture at the interpolated UV.
Step 3: vertex data for a quad
A quad is two triangles. We define six vertices (or four vertices with an index buffer — we will use six for simplicity):
#![allow(unused)]
fn main() {
#[repr(C)]
#[derive(Copy, Clone, bytemuck::Pod, bytemuck::Zeroable)]
struct Vertex {
position: [f32; 3],
uv: [f32; 2],
}
// Two triangles forming a quad
const VERTICES: &[Vertex] = &[
// Triangle 1 (top-left half)
Vertex { position: [-0.5, 0.5, 0.0], uv: [0.0, 0.0] }, // top-left
Vertex { position: [-0.5, -0.5, 0.0], uv: [0.0, 1.0] }, // bottom-left
Vertex { position: [ 0.5, 0.5, 0.0], uv: [1.0, 0.0] }, // top-right
// Triangle 2 (bottom-right half)
Vertex { position: [ 0.5, 0.5, 0.0], uv: [1.0, 0.0] }, // top-right
Vertex { position: [-0.5, -0.5, 0.0], uv: [0.0, 1.0] }, // bottom-left
Vertex { position: [ 0.5, -0.5, 0.0], uv: [1.0, 1.0] }, // bottom-right
];
}
Step 4: generate a procedural checkerboard texture
#![allow(unused)]
fn main() {
/// Generate an 8x8 checkerboard pattern as RGBA bytes.
fn make_checkerboard(width: u32, height: u32, cell_size: u32) -> Vec<u8> {
let mut pixels = Vec::with_capacity((width * height * 4) as usize);
for y in 0..height {
for x in 0..width {
let is_white = ((x / cell_size) + (y / cell_size)) % 2 == 0;
let val = if is_white { 255u8 } else { 80u8 };
pixels.push(val); // R
pixels.push(val); // G
pixels.push(val); // B
pixels.push(255); // A
}
}
pixels
}
}
Call it with make_checkerboard(256, 256, 32) to get a 256x256 texture with 32-pixel checker cells.
Step 5: create the texture, view, and sampler
#![allow(unused)]
fn main() {
let tex_size = 256u32;
let tex_data = make_checkerboard(tex_size, tex_size, 32);
let texture = device.create_texture(&wgpu::TextureDescriptor {
label: Some("Checkerboard Texture"),
size: wgpu::Extent3d {
width: tex_size,
height: tex_size,
depth_or_array_layers: 1,
},
mip_level_count: 1,
sample_count: 1,
dimension: wgpu::TextureDimension::D2,
format: wgpu::TextureFormat::Rgba8UnormSrgb,
usage: wgpu::TextureUsages::TEXTURE_BINDING | wgpu::TextureUsages::COPY_DST,
view_formats: &[],
});
queue.write_texture(
wgpu::TexelCopyTextureInfo {
texture: &texture,
mip_level: 0,
origin: wgpu::Origin3d::ZERO,
aspect: wgpu::TextureAspect::All,
},
&tex_data,
wgpu::TexelCopyBufferLayout {
offset: 0,
bytes_per_row: Some(4 * tex_size),
rows_per_image: Some(tex_size),
},
wgpu::Extent3d {
width: tex_size,
height: tex_size,
depth_or_array_layers: 1,
},
);
let texture_view = texture.create_view(&Default::default());
let sampler = device.create_sampler(&wgpu::SamplerDescriptor {
label: Some("Checkerboard Sampler"),
mag_filter: wgpu::FilterMode::Nearest, // crisp pixels for checkerboard
min_filter: wgpu::FilterMode::Nearest,
..Default::default()
});
}
Step 6: bind group setup
#![allow(unused)]
fn main() {
let bind_group_layout = device.create_bind_group_layout(
&wgpu::BindGroupLayoutDescriptor {
label: Some("Texture Bind Group Layout"),
entries: &[
wgpu::BindGroupLayoutEntry {
binding: 0,
visibility: wgpu::ShaderStages::FRAGMENT,
ty: wgpu::BindingType::Texture {
sample_type: wgpu::TextureSampleType::Float { filterable: true },
view_dimension: wgpu::TextureViewDimension::D2,
multisampled: false,
},
count: None,
},
wgpu::BindGroupLayoutEntry {
binding: 1,
visibility: wgpu::ShaderStages::FRAGMENT,
ty: wgpu::BindingType::Sampler(wgpu::SamplerBindingType::Filtering),
count: None,
},
],
},
);
let bind_group = device.create_bind_group(&wgpu::BindGroupDescriptor {
label: Some("Texture Bind Group"),
layout: &bind_group_layout,
entries: &[
wgpu::BindGroupEntry {
binding: 0,
resource: wgpu::BindingResource::TextureView(&texture_view),
},
wgpu::BindGroupEntry {
binding: 1,
resource: wgpu::BindingResource::Sampler(&sampler),
},
],
});
}
Remember to include &bind_group_layout in your pipeline layout’s bind_group_layouts array, and update the vertex buffer layout to match the new Vertex struct (position: Float32x3 at offset 0, uv: Float32x2 at offset 12).
Step 7: draw the quad
In your render pass:
#![allow(unused)]
fn main() {
pass.set_pipeline(&gpu.pipeline);
pass.set_bind_group(0, &gpu.bind_group, &[]);
pass.set_vertex_buffer(0, gpu.vertex_buffer.slice(..));
pass.draw(0..6, 0..1); // 6 vertices = 2 triangles = 1 quad
}
Expected result
You should see a rectangle in the centre of the window showing a black-and-white checkerboard pattern. The texture is mapped so that the full checkerboard fills the quad exactly.
Challenge: try changing the sampler’s mag_filter from Nearest to Linear and see how the checkerboard edges become blurred when the quad is large. Then try setting address_mode_u and address_mode_v to Repeat, and change the UVs to go from 0 to 3 — you will see the checkerboard tile three times across the quad.
Key takeaway: texturing involves creating a texture from pixel data, configuring a sampler for filtering and wrapping, binding both via a bind group, and sampling in the fragment shader using interpolated UV coordinates. This same pattern applies whether your texture is a checkerboard, a photograph, or a render target from a previous pass.
Part 5 — Compute Shaders
13. Compute pipelines: dispatching work groups
Compute shaders break free from the graphics pipeline entirely. There are no vertices, no triangles, no pixels — just raw parallel computation. This makes them ideal for physics simulations, image processing, data transformations, and any task that benefits from GPU parallelism.
Graphics pipeline vs compute pipeline
Graphics Pipeline Compute Pipeline
================= ================
Vertices Dispatch(x, y, z)
│ │
▼ ▼
Vertex Shader Compute Shader
│ │
▼ ▼
Rasterisation Storage buffers /
│ textures (output)
▼
Fragment Shader
│
▼
Framebuffer (pixels)
Produces images Produces data
With compute shaders, you do not set up vertex buffers, render passes, or colour attachments. Instead, you dispatch work and let the compute shader read/write storage buffers or textures directly.
Work groups and invocations
When you dispatch a compute shader, you specify a 3D grid of work groups. Each work group contains a fixed number of invocations (threads), defined by @workgroup_size in the shader.
Dispatch and Work Groups
========================
dispatch(4, 3, 1) ← 4 x 3 x 1 = 12 work groups
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│ WG │ │ WG │ │ WG │ │ WG │ row 0
│(0,0)│ │(1,0)│ │(2,0)│ │(3,0)│
└─────┘ └─────┘ └─────┘ └─────┘
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│ WG │ │ WG │ │ WG │ │ WG │ row 1
│(0,1)│ │(1,1)│ │(2,1)│ │(3,1)│
└─────┘ └─────┘ └─────┘ └─────┘
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│ WG │ │ WG │ │ WG │ │ WG │ row 2
│(0,2)│ │(1,2)│ │(2,2)│ │(3,2)│
└─────┘ └─────┘ └─────┘ └─────┘
Inside each work group (e.g., @workgroup_size(8, 8, 1)):
┌─┬─┬─┬─┬─┬─┬─┬─┐
│·│·│·│·│·│·│·│·│ 8 invocations wide
├─┼─┼─┼─┼─┼─┼─┼─┤
│·│·│·│·│·│·│·│·│ x 8 invocations tall
├─┼─┼─┼─┼─┼─┼─┼─┤
│·│·│·│·│·│·│·│·│ = 64 invocations per
├─┼─┼─┼─┼─┼─┼─┼─┤ work group
│·│·│·│·│·│·│·│·│
├─┼─┼─┼─┼─┼─┼─┼─┤
│·│·│·│·│·│·│·│·│
├─┼─┼─┼─┼─┼─┼─┼─┤
│·│·│·│·│·│·│·│·│
├─┼─┼─┼─┼─┼─┼─┼─┤
│·│·│·│·│·│·│·│·│
├─┼─┼─┼─┼─┼─┼─┼─┤
│·│·│·│·│·│·│·│·│
└─┴─┴─┴─┴─┴─┴─┴─┘
Total invocations = 12 work groups x 64 = 768 threads
Built-in IDs
Each invocation knows its position in the grid via built-in variables:
| Built-in | Type | Meaning |
|---|---|---|
global_invocation_id | vec3<u32> | Unique ID across the entire dispatch |
local_invocation_id | vec3<u32> | ID within the work group (0 to workgroup_size-1) |
workgroup_id | vec3<u32> | Which work group this invocation belongs to |
num_workgroups | vec3<u32> | Total number of work groups dispatched |
global_invocation_id is the most commonly used — it gives each thread a unique index.
@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) id: vec3<u32>) {
let index = id.x;
// Process element at `index`
}
Choosing workgroup_size
The @workgroup_size(x, y, z) declaration sets how many invocations run per work group. Guidelines:
- Total invocations per group (x * y * z) should be a multiple of 32 or 64 for best performance (matching GPU warp/wavefront size)
- Common choices:
@workgroup_size(64),@workgroup_size(256),@workgroup_size(8, 8)(for 2D),@workgroup_size(4, 4, 4)(for 3D) - The maximum total varies by GPU but is typically 256 or 1024
Creating a compute pipeline in Rust
#![allow(unused)]
fn main() {
let compute_shader = device.create_shader_module(wgpu::ShaderModuleDescriptor {
label: Some("Compute Shader"),
source: wgpu::ShaderSource::Wgsl(shader_source.into()),
});
let compute_pipeline = device.create_compute_pipeline(
&wgpu::ComputePipelineDescriptor {
label: Some("Compute Pipeline"),
layout: Some(&pipeline_layout),
module: &compute_shader,
entry_point: Some("main"),
compilation_options: Default::default(),
cache: None,
},
);
}
Dispatching work
Instead of a render pass, you use a compute pass:
#![allow(unused)]
fn main() {
let mut encoder = device.create_command_encoder(&Default::default());
{
let mut compute_pass = encoder.begin_compute_pass(&Default::default());
compute_pass.set_pipeline(&compute_pipeline);
compute_pass.set_bind_group(0, &bind_group, &[]);
compute_pass.dispatch_workgroups(num_groups_x, num_groups_y, num_groups_z);
}
queue.submit(std::iter::once(encoder.finish()));
}
If you have 1024 elements and your workgroup_size is 64, you dispatch 1024 / 64 = 16 work groups: dispatch_workgroups(16, 1, 1).
Key takeaway: compute shaders run outside the graphics pipeline. You dispatch a 3D grid of work groups, each containing a fixed number of invocations. Every invocation gets a unique
global_invocation_idto determine which data element to process. This is how you harness the GPU’s parallelism for general-purpose computation.
14. Storage buffers and read/write access from WGSL
Compute shaders need to read input data and write output data. Storage buffers are the primary mechanism for this. Unlike uniform buffers (which are small and read-only), storage buffers can be large and support both reading and writing.
Storage buffers vs uniform buffers
| Feature | Uniform Buffer | Storage Buffer |
|---|---|---|
| Max size | ~64 KB (varies) | Hundreds of MB |
| Access | Read-only | Read-only or read-write |
| Speed | Faster (cached aggressively) | Slightly slower |
| Use case | Small, per-frame constants | Large data arrays |
Use uniform buffers for things like transformation matrices, time values, and camera parameters. Use storage buffers for arrays of particles, pixels, mesh data, or any large dataset.
Declaring storage buffers in WGSL
// Read-only storage buffer
@group(0) @binding(0)
var<storage, read> input: array<f32>;
// Read-write storage buffer
@group(0) @binding(1)
var<storage, read_write> output: array<f32>;
You can also use structs:
struct Particle {
position: vec2<f32>,
velocity: vec2<f32>,
}
@group(0) @binding(0)
var<storage, read_write> particles: array<Particle>;
Accessing storage buffer data
Storage buffers behave like regular arrays in WGSL:
@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) id: vec3<u32>) {
let i = id.x;
// Bounds check — important when dispatch size
// does not evenly divide the data
if i >= arrayLength(&particles) {
return;
}
// Read
let pos = particles[i].position;
let vel = particles[i].velocity;
// Compute
let new_pos = pos + vel * delta_time;
// Write back
particles[i].position = new_pos;
}
The arrayLength(&buffer) function returns the number of elements in a runtime-sized array. Always use it for bounds checking — if your dispatch creates more invocations than data elements, the extra threads must bail out early.
Creating storage buffers in Rust
#![allow(unused)]
fn main() {
// Create a storage buffer from initial data
let storage_buffer = device.create_buffer_init(&wgpu::util::BufferInitDescriptor {
label: Some("Particle Buffer"),
contents: bytemuck::cast_slice(&initial_particles),
usage: wgpu::BufferUsages::STORAGE
| wgpu::BufferUsages::COPY_SRC // to read back to CPU
| wgpu::BufferUsages::COPY_DST, // to write from CPU
});
}
The STORAGE usage flag is required. Add COPY_SRC if you want to read data back to the CPU, and COPY_DST if you want to upload data from the CPU.
Bind group layout for storage buffers
#![allow(unused)]
fn main() {
wgpu::BindGroupLayoutEntry {
binding: 0,
visibility: wgpu::ShaderStages::COMPUTE,
ty: wgpu::BindingType::Buffer {
ty: wgpu::BufferBindingType::Storage {
read_only: false, // true for read-only access
},
has_dynamic_offset: false,
min_binding_size: None,
},
count: None,
}
}
Reading results back to the CPU
GPU buffers are not directly accessible from CPU memory. To read results back, you copy to a staging buffer with MAP_READ usage:
#![allow(unused)]
fn main() {
// Create a staging buffer
let staging_buffer = device.create_buffer(&wgpu::BufferDescriptor {
label: Some("Staging Buffer"),
size: storage_buffer.size(),
usage: wgpu::BufferUsages::MAP_READ | wgpu::BufferUsages::COPY_DST,
mapped_at_creation: false,
});
// Copy from storage to staging
encoder.copy_buffer_to_buffer(
&storage_buffer, 0,
&staging_buffer, 0,
storage_buffer.size(),
);
queue.submit(std::iter::once(encoder.finish()));
// Map the staging buffer and read the data
let slice = staging_buffer.slice(..);
slice.map_async(wgpu::MapMode::Read, |_| {});
device.poll(wgpu::Maintain::Wait);
let data = slice.get_mapped_range();
let result: &[Particle] = bytemuck::cast_slice(&data);
// Use the result...
drop(data);
staging_buffer.unmap();
}
Memory considerations
- Workgroup memory: WGSL also supports
var<workgroup>for shared memory within a work group. This is very fast but limited in size (typically 16-48 KB). - Synchronization: within a work group, use
workgroupBarrier()to ensure all threads have finished writing before any thread reads shared data. Across work groups, there is no synchronization within a single dispatch — use separate dispatches if you need global barriers.
var<workgroup> shared_data: array<f32, 64>;
@compute @workgroup_size(64)
fn main(@builtin(local_invocation_id) lid: vec3<u32>) {
shared_data[lid.x] = some_computation();
workgroupBarrier(); // wait for all threads in this group
let neighbour = shared_data[(lid.x + 1u) % 64u];
}
Key takeaway: storage buffers are the workhorse of compute shaders — they hold large arrays that shaders can read and write. Declare them with
var<storage, read_write>in WGSL, create them withBufferUsages::STORAGEin Rust, and always bounds-check witharrayLength. To read results back to CPU, copy to a staging buffer withMAP_READ.
15. Exercise 5: GPU-accelerate a particle simulation
In this exercise you will build a simple particle system where thousands of particles are updated each frame by a compute shader. Particles will have positions and velocities, bounce off the edges of the screen, and be rendered as points.
Overview
The architecture is:
┌──────────────┐ ┌──────────────────┐ ┌─────────────┐
│ CPU: init │────►│ GPU: compute pass │────►│ GPU: render │
│ particles │ │ update positions │ │ pass: draw │
│ once │ │ each frame │ │ as points │
└──────────────┘ └──────────────────┘ └─────────────┘
│
▼
Storage buffer
(read/write by
compute shader,
read as vertex
buffer for render)
The same buffer serves double duty: the compute shader writes updated positions into it, and the render pass reads it as a vertex buffer.
Step 1: particle data structure
#![allow(unused)]
fn main() {
#[repr(C)]
#[derive(Copy, Clone, bytemuck::Pod, bytemuck::Zeroable)]
struct Particle {
position: [f32; 2],
velocity: [f32; 2],
}
}
Step 2: initialise particles
#![allow(unused)]
fn main() {
use rand::Rng;
fn create_particles(count: usize) -> Vec<Particle> {
let mut rng = rand::rng();
(0..count)
.map(|_| Particle {
position: [
rng.random_range(-1.0f32..1.0),
rng.random_range(-1.0f32..1.0),
],
velocity: [
rng.random_range(-0.5f32..0.5),
rng.random_range(-0.5f32..0.5),
],
})
.collect()
}
}
Add rand = "0.9" to your Cargo.toml.
Step 3: the compute shader (WGSL)
struct Particle {
position: vec2<f32>,
velocity: vec2<f32>,
}
@group(0) @binding(0)
var<storage, read_write> particles: array<Particle>;
@group(0) @binding(1)
var<uniform> delta_time: f32;
@compute @workgroup_size(64)
fn cs_main(@builtin(global_invocation_id) id: vec3<u32>) {
let i = id.x;
if i >= arrayLength(&particles) {
return;
}
var p = particles[i];
// Update position
p.position = p.position + p.velocity * delta_time;
// Bounce off edges
if p.position.x < -1.0 || p.position.x > 1.0 {
p.velocity.x = -p.velocity.x;
p.position.x = clamp(p.position.x, -1.0, 1.0);
}
if p.position.y < -1.0 || p.position.y > 1.0 {
p.velocity.y = -p.velocity.y;
p.position.y = clamp(p.position.y, -1.0, 1.0);
}
particles[i] = p;
}
Step 4: the render shader (WGSL)
To render particles as points, the vertex shader reads the position from the storage buffer. Each particle becomes one point:
struct RenderOutput {
@builtin(position) pos: vec4<f32>,
@builtin(point_size) size: f32,
}
// We read the same particle buffer as a storage buffer for rendering
@group(0) @binding(0)
var<storage, read> render_particles: array<Particle>;
@vertex
fn vs_render(@builtin(vertex_index) vi: u32) -> RenderOutput {
var out: RenderOutput;
let p = render_particles[vi];
out.pos = vec4f(p.position, 0.0, 1.0);
out.size = 2.0;
return out;
}
@fragment
fn fs_render() -> @location(0) vec4<f32> {
return vec4f(0.2, 0.8, 0.4, 1.0); // green particles
}
Note: @builtin(point_size) is an optional feature; not all backends support it. An alternative approach is to render each particle as a small quad using instancing.
Step 5: buffer creation
#![allow(unused)]
fn main() {
let num_particles = 10_000u32;
let particles = create_particles(num_particles as usize);
let particle_buffer = device.create_buffer_init(&wgpu::util::BufferInitDescriptor {
label: Some("Particle Buffer"),
contents: bytemuck::cast_slice(&particles),
usage: wgpu::BufferUsages::STORAGE | wgpu::BufferUsages::VERTEX,
});
let dt_buffer = device.create_buffer(&wgpu::BufferDescriptor {
label: Some("Delta Time Buffer"),
size: 4,
usage: wgpu::BufferUsages::UNIFORM | wgpu::BufferUsages::COPY_DST,
mapped_at_creation: false,
});
}
The particle buffer has both STORAGE (for the compute shader) and VERTEX (for the render pipeline) usage flags.
Step 6: frame loop
Each frame:
- Calculate delta time
- Write delta time to the uniform buffer
- Run the compute pass to update particles
- Run the render pass to draw particles
#![allow(unused)]
fn main() {
// Compute pass
{
let mut cpass = encoder.begin_compute_pass(&Default::default());
cpass.set_pipeline(&compute_pipeline);
cpass.set_bind_group(0, &compute_bind_group, &[]);
let num_workgroups = (num_particles + 63) / 64; // round up
cpass.dispatch_workgroups(num_workgroups, 1, 1);
}
// Render pass
{
let mut rpass = encoder.begin_render_pass(&wgpu::RenderPassDescriptor {
color_attachments: &[Some(wgpu::RenderPassColorAttachment {
view: &view,
resolve_target: None,
ops: wgpu::Operations {
load: wgpu::LoadOp::Clear(wgpu::Color::BLACK),
store: wgpu::StoreOp::Store,
},
})],
..Default::default()
});
rpass.set_pipeline(&render_pipeline);
rpass.set_bind_group(0, &render_bind_group, &[]);
rpass.draw(0..num_particles, 0..1);
}
}
Note how dispatch_workgroups rounds up: (10000 + 63) / 64 = 157 work groups, giving 10048 invocations. The bounds check in the shader (if i >= arrayLength(&particles)) prevents the extra 48 threads from accessing out-of-bounds memory.
Expected result
You should see thousands of small green particles bouncing around the window, all updated in parallel on the GPU. With 10,000 particles at 60 FPS, the GPU handles 600,000 particle updates per second with ease — and it could handle millions.
Challenge: add a gravity force (p.velocity.y -= 9.8 * delta_time) and watch the particles fall and bounce off the bottom edge. Or add mouse interaction — pass the mouse position as a uniform and apply a force toward or away from the cursor.
Key takeaway: compute shaders can update large datasets in parallel every frame. By giving a buffer both
STORAGEandVERTEXusage flags, you can update data in a compute pass and render it in a render pass without copying between buffers. This compute-then-render pattern is the foundation of GPU-driven simulations.
Part 6 — Going Further
16. Post-processing effects (bloom, blur): conceptual overview
So far, you have rendered directly to the screen. But many visual effects require multi-pass rendering: render the scene to an intermediate texture first, then process that texture in subsequent passes before displaying the final result. This is called post-processing.
Render-to-texture
Instead of targeting the swap chain texture directly, you create an off-screen texture and render to it:
Render-to-Texture
==================
Pass 1: Render scene Pass 2: Post-process
┌──────────────────┐ ┌──────────────────┐
│ │ │ │
│ Scene geometry │──render to──► │ Full-screen │──render to──► Screen
│ (3D objects) │ off-screen │ quad sampling │ swap chain
│ │ texture │ the texture │ texture
└──────────────────┘ └──────────────────┘
In wgpu, this means creating a wgpu::Texture with RENDER_ATTACHMENT | TEXTURE_BINDING usage. You render to it in pass 1, then sample from it in pass 2.
Bloom effect
Bloom makes bright areas of an image glow, simulating how real cameras and eyes perceive very bright light. The algorithm has three stages:
Bloom Pipeline
==============
Scene ──► [1. Threshold] ──► [2. Blur] ──► [3. Composite] ──► Final
Extract Gaussian Add blurred
bright blur the bright areas
pixels result back onto
only the original
Stage 1 — Threshold: a fragment shader that outputs only pixels brighter than a threshold, and black for everything else.
@fragment
fn threshold(in: FullscreenInput) -> @location(0) vec4<f32> {
let color = textureSample(scene_texture, samp, in.uv);
let brightness = dot(color.rgb, vec3f(0.2126, 0.7152, 0.0722));
if brightness > 0.8 {
return color;
}
return vec4f(0.0, 0.0, 0.0, 1.0);
}
The dot with (0.2126, 0.7152, 0.0722) computes perceptual luminance — the human eye is most sensitive to green, then red, then blue.
Stage 2 — Gaussian blur: blur the thresholded image so bright spots become soft glows. Gaussian blur is separable — you can split a 2D blur into two 1D passes (horizontal then vertical), which is much faster:
Separable Gaussian Blur
=======================
Bright Horizontal Vertical Blurred
pixels ──► blur pass ──► blur pass ──► result
(1D, left (1D, up
to right) to down)
A 9x9 2D kernel = 81 samples per pixel
Two 9-wide 1D kernels = 18 samples per pixel
Same result, 4.5x faster!
A single-direction blur shader samples several neighbouring texels with Gaussian weights:
@fragment
fn blur_horizontal(in: FullscreenInput) -> @location(0) vec4<f32> {
let texel_size = 1.0 / f32(textureDimensions(source).x);
var result = vec4f(0.0);
// Gaussian weights for a 5-tap kernel
let weights = array<f32, 5>(0.227, 0.194, 0.122, 0.054, 0.016);
let offsets = array<f32, 5>(0.0, 1.0, 2.0, 3.0, 4.0);
for (var i = 0u; i < 5u; i = i + 1u) {
let offset = vec2f(offsets[i] * texel_size, 0.0);
result += textureSample(source, samp, in.uv + offset) * weights[i];
if i > 0u {
result += textureSample(source, samp, in.uv - offset) * weights[i];
}
}
return result;
}
Stage 3 — Composite: add the blurred bright areas back onto the original scene:
@fragment
fn composite(in: FullscreenInput) -> @location(0) vec4<f32> {
let scene = textureSample(scene_texture, samp, in.uv);
let bloom = textureSample(bloom_texture, samp, in.uv);
return scene + bloom * bloom_intensity;
}
Other post-processing effects
The render-to-texture pattern enables many effects:
- Colour grading: adjust contrast, saturation, colour curves
- Vignette: darken the edges of the screen
- Chromatic aberration: split RGB channels with slight offsets
- Motion blur: blend the current frame with previous frames
- Depth of field: blur based on distance from a focal point (requires a depth buffer)
- Screen-space ambient occlusion (SSAO): approximate indirect shadows
Each effect is a fragment shader running on a full-screen quad, sampling from the previous pass’s texture.
Key takeaway: post-processing effects are implemented as multi-pass rendering. You render the scene to an off-screen texture, then process it through one or more full-screen fragment shader passes. Bloom is a classic example: threshold bright pixels, blur them with separable Gaussian passes, and composite the glow back onto the original. This pattern is the backbone of modern real-time visual effects.
17. Signed Distance Fields for font rendering
Rendering crisp text at any size and rotation is surprisingly difficult with traditional bitmap fonts. Signed Distance Fields (SDFs) provide an elegant solution that gives resolution-independent, anti-aliased text with a single texture.
The problem with bitmap fonts
A bitmap font is a texture where each character is stored as a grid of pixels:
Bitmap "A" at 32px: Zoomed in (pixelated):
┌────────────┐ ┌──┬──┬──┬──┬──┬──┐
│ ██ │ │ │ │██│██│ │ │
│ █ █ │ │ │██│ │ │██│ │
│ ██████ │ │██│██│██│██│██│██│
│ █ █ │ │██│ │ │ │ │██│
│ █ █ │ │██│ │ │ │ │██│
└────────────┘ └──┴──┴──┴──┴──┴──┘
Looks fine at 32px. Looks blocky at 128px.
If you scale the bitmap up, it becomes pixelated. If you scale it down, details are lost. You would need multiple texture sizes, wasting memory.
What is a Signed Distance Field?
An SDF stores, for each texel, the distance to the nearest edge of the shape. Texels inside the shape have negative distances; texels outside have positive distances. The zero-crossing is the exact edge.
SDF for a circle:
+3 +2 +1 0 -1 -2 -1 0 +1 +2 +3
+2 +1 0 -1 -2 -3 -2 -1 0 +1 +2
+1 0 -1 -2 -3 -4 -3 -2 -1 0 +1
0 -1 -2 -3 -4 -5 -4 -3 -2 -1 0
+1 0 -1 -2 -3 -4 -3 -2 -1 0 +1
+2 +1 0 -1 -2 -3 -2 -1 0 +1 +2
+3 +2 +1 0 -1 -2 -1 0 +1 +2 +3
← 0 is the edge. Negative = inside. Positive = outside.
The key insight is that this distance information contains the shape at any resolution. To render, you simply check whether the distance is negative (inside, draw the character) or positive (outside, draw nothing).
The smoothstep trick
Hard thresholding (inside vs outside) gives jagged edges. The smoothstep function provides perfect anti-aliasing by creating a smooth transition in a narrow band around the edge:
@fragment
fn sdf_text(in: VertexOutput) -> @location(0) vec4<f32> {
// Sample the SDF texture — value is distance to edge
let distance = textureSample(sdf_texture, samp, in.uv).r;
// smoothstep creates a smooth transition near the edge
// 0.5 is the edge; the range (0.45, 0.55) is the anti-alias band
let alpha = smoothstep(0.45, 0.55, distance);
return vec4f(text_color.rgb, alpha);
}
smoothstep visualised:
alpha
1.0 ─────────────────────╮
╲
╲ ← smooth transition
╲ (anti-aliased edge)
0.0 ╰───────────────────
outside edge inside
0.45 0.5 0.55
The width of the transition band can be adjusted. A narrower band gives sharper text; a wider band gives softer text. You can even compute the band width based on the rate of change of the UV coordinates (using fwidth) to get pixel-perfect anti-aliasing at any scale:
let distance = textureSample(sdf_texture, samp, in.uv).r;
let edge = 0.5;
let aa_width = fwidth(distance) * 0.75;
let alpha = smoothstep(edge - aa_width, edge + aa_width, distance);
Advantages of SDF text
- Resolution-independent: one small texture (e.g., 64x64 per glyph) looks crisp at any display size
- Cheap anti-aliasing: just
smoothstep— no multisampling needed - Effects for free: outlines, drop shadows, and glow are trivial to add by adjusting the distance threshold:
// Outline effect
let outline_alpha = smoothstep(0.35, 0.40, distance); // outer edge of outline
let fill_alpha = smoothstep(0.45, 0.55, distance); // inner fill
let color = mix(outline_color, fill_color, fill_alpha);
let alpha = outline_alpha;
SDF effects by varying the threshold:
┌──────────────────────────────────┐
│ dist < 0.35 → outside (transparent) │
│ 0.35 to 0.45 → outline │
│ dist > 0.45 → fill (solid text) │
└──────────────────────────────────┘
Generating SDF textures
SDF textures are typically pre-generated offline. Tools include:
- msdfgen: generates multi-channel SDFs for even sharper edges
- Hiero (LibGDX): generates SDF font atlases
- fontdue (Rust crate): can generate SDF glyph bitmaps
The generated SDF texture is a single-channel (greyscale) image where 0.5 represents the edge, values above 0.5 are inside the glyph, and values below 0.5 are outside.
Key takeaway: signed distance fields store the distance to a shape’s edge at each texel. This allows rendering crisp, anti-aliased shapes at any resolution from a small texture. The
smoothstepfunction provides the anti-aliasing, and varying the distance threshold enables outlines, glows, and shadows. SDF-based text rendering is used in game engines, mapping applications, and anywhere resolution-independent text is needed.
18. Resources: Learn WGPU, Shadertoy, The Book of Shaders
This section collects the best resources for continuing your shader programming journey. Each resource approaches the topic from a different angle — use them together for a well-rounded education.
Tutorials and courses
Learn WGPU — the definitive tutorial for wgpu in Rust. It walks through window setup, textures, camera systems, lighting, instancing, and more, with complete working code at each step. If you want to build on the exercises in this course, this is the natural next step.
The Book of Shaders by Patricio Gonzalez Vivo and Jen Lowe — a gentle, visual introduction to fragment shaders. It uses GLSL (not WGSL), but the concepts translate directly: noise functions, patterns, colour mixing, shapes, and animation. The interactive editor lets you experiment in real time. Excellent for building shader intuition.
GPU Gems — NVIDIA’s classic book series (available free online). Covers advanced topics like water rendering, subsurface scattering, shadow techniques, and GPU physics. The techniques are presented in HLSL/GLSL but the algorithms are API-agnostic.
WebGPU Fundamentals — explains WebGPU concepts from the ground up with JavaScript examples. Since wgpu implements the WebGPU spec, the API concepts map directly to Rust. Useful for understanding the “why” behind API design decisions.
Interactive playgrounds
Shadertoy — a web-based shader playground where you write fragment shaders (GLSL) and see results immediately. The community has created incredible effects: raymarched landscapes, fluid simulations, fractal zooms, entire games. Study other people’s shaders to learn techniques — the compact format forces creative solutions. You can port Shadertoy ideas to WGSL in your wgpu projects.
WGSL Playground — Google’s Tour of WGSL. An interactive introduction to the WGSL language with runnable examples. Good for quickly testing WGSL syntax.
Specifications and references
WebGPU Specification — the official W3C specification that wgpu implements. Dense but authoritative. Useful when you need to understand exact behaviour.
WGSL Specification — the complete language specification for WGSL. Reference for built-in functions, types, memory models, and grammar.
wgpu documentation (docs.rs) — Rust API documentation for the wgpu crate. Essential reference for looking up function signatures, enum variants, and descriptor fields.
Advanced topics to explore
Once you are comfortable with the basics covered in this course, here are directions to explore:
- 3D rendering: model-view-projection matrices, depth buffers, camera systems
- Lighting: Phong, Blinn-Phong, physically-based rendering (PBR)
- Shadow mapping: rendering depth from light’s perspective, shadow comparison
- Instancing: drawing thousands of objects efficiently with a single draw call
- Raymarching: rendering 3D scenes using signed distance functions (no triangles)
- Procedural generation: noise functions (Perlin, Simplex) for terrain, textures, and clouds
- Deferred rendering: separating geometry and lighting into different passes
- Skeletal animation: vertex skinning with bone matrices
Community
- wgpu GitHub — the source code, issue tracker, and examples
- WebGPU Matrix channel — real-time chat with the wgpu developers and community
- r/rust_gamedev — Rust game development community on Reddit, where wgpu projects are frequently shared
Key takeaway: shader programming is a vast field. Start with Learn WGPU for Rust-specific guidance, The Book of Shaders for visual intuition, and Shadertoy for inspiration. Keep the WGSL spec and wgpu docs.rs handy as references. The GPU programming community is active and welcoming — share your work and learn from others.