Radix Sort

GPU radix sort for u32 arrays

Initializing WebGPU...

Quick Start

Loading...

Source

Loading...

Documentation

Radix Sort

GPU-accelerated radix sort for unsigned 32-bit integers. Pure compute — no visual output. Significantly faster than CPU sorting for large arrays (100K+ elements).

Usage

import { createRadixSort } from './radix-sort';

const sorter = createRadixSort(device, { maxElements: 1_000_000 });

// Sort a buffer in-place
sorter.sort(myBuffer, elementCount);

// Clean up when done
sorter.destroy();

API

createRadixSort(device, options?)

Returns a RadixSort instance.

Option Type Default Description
maxElements number 1_000_000 Maximum number of elements (determines internal buffer sizes)

sorter.sort(buffer, elementCount)

Sorts a GPUBuffer of u32 values in-place (ascending order).

Param Type Description
buffer GPUBuffer Buffer containing u32 values. Must have STORAGE | COPY_SRC | COPY_DST usage.
elementCount number Number of elements to sort (not bytes)

sorter.sortPairs(keysBuffer, valuesBuffer, elementCount)

Sorts keys and reorders associated values to match. Currently sorts keys only — see "Extending to key-value sort" below for how to implement paired reordering.

sorter.destroy()

Releases all internal GPU buffers.

Algorithm

This implements a parallel radix sort processing 4 bits per pass (16 buckets), requiring 8 passes to sort 32-bit keys.

Each pass runs three compute kernels in sequence:

  1. Histogram — Each workgroup counts how many of its elements fall into each of the 16 buckets for the current 4-bit digit. Uses workgroup-shared atomics for fast local counting, then writes results to global memory.

  2. Prefix sum — An exclusive scan over the per-workgroup histograms computes global write offsets. The histogram is laid out as [bucket][workgroup], so scanning linearly preserves the relative order of elements within each bucket (stable sort).

  3. Scatter — Each element looks up its destination from the prefix sum and writes to the output buffer. Uses atomics to get unique positions for elements in the same bucket within a workgroup.

Ping-pong buffers alternate between passes to avoid copying.

WGSL loading

The default import uses Vite's ?raw suffix:

import shaderSource from './radix-sort.wgsl?raw';

If you're not using a bundler, load via fetch:

const shaderSource = await fetch(new URL('./radix-sort.wgsl', import.meta.url)).then((r) =>
  r.text()
);

Modifying

Sort f32 values instead of u32

Floating-point values need a bit transformation to sort correctly with radix sort. Before sorting, apply this transform to each key:

if (key >> 31) == 1:  key = ~key          // negative: flip all bits
else:                 key = key ^ 0x80000000  // positive: flip sign bit

Apply the inverse after sorting. You can add a pre-processing and post-processing compute pass, or modify the histogram and scatter kernels to apply the transform inline.

Change the radix width

The current implementation uses 4 bits (16 buckets, 8 passes). You could use:

  • 8 bits (256 buckets, 4 passes) — fewer passes but more shared memory and larger histograms
  • 2 bits (4 buckets, 16 passes) — less shared memory but more passes

Change RADIX_BITS and NUM_BUCKETS in the WGSL, and NUM_PASSES in the TypeScript.

Extending to key-value sort

To sort key-value pairs, add a second set of ping-pong buffers for values. In the scatter kernel, when writing output[dest] = element, also write values_output[dest] = values_input[idx]. This requires adding additional buffer bindings to the bind group.

Use as a building block

Radix sort is commonly used as a sub-routine in:

  • Particle systems — sort particles by depth for correct alpha blending
  • Spatial hashing — sort by cell index to group spatially close elements
  • GPU BVH construction — sort Morton codes for tree building
  • Prefix sum / stream compaction — the prefix sum kernel can be reused independently

Further Reading

Resources on GPU radix sort algorithms and parallel sorting techniques.

Core Papers

Prefix Sum (Scan)

WebGPU-Specific

General References

  • Cormen, Leiserson, Rivest, Stein, "Introduction to Algorithms" (CLRS) Chapter 8 covers radix sort in the sequential setting. Understanding the sequential algorithm helps reason about the parallel version's correctness.

  • Duane Merrill, "CUB Library" CUDA's high-performance primitives library, including the most optimized GPU radix sort implementation. A good reference for advanced optimization techniques. https://github.com/NVIDIA/cub

  • Vulkan / CUDA Radix Sort Benchmarks Practical performance comparisons across GPU sorting implementations, useful for understanding expected throughput. https://github.com/b0nes164/GPUSorting