About the Challenge
Attention Kernel Challenge
Build the fastest numerically faithful block-sparse attention backend for H100.
A kernel is a small program that runs directly on the GPU. In modern ML systems, a surprising amount of end-to-end performance comes down to a handful of kernels that sit in the hot path. When one of them gets faster, the improvement shows up everywhere: lower latency, higher throughput, and cheaper training and inference.
Attention is one of the most important of those hot paths. Full attention becomes expensive quickly as sequence length grows, so real systems often rely on sparse patterns that preserve the interactions they need while skipping the rest. Sliding windows, global tokens, sink tokens, and retrieval blocks are all ways to stretch context length without paying the full dense cost on every token.
This challenge focuses on exact causal block-sparse attention on 1xH100 SXM 80GB. It rewards the kind of work good systems people like doing: memory layout, tiling, scheduling, launch configuration, and careful measurement. The target is narrow enough that progress is legible and open-ended enough that different engineering instincts can still matter.
Most of the grinding can happen locally. You can iterate on correctness, rewrite kernels, and profile ideas without touching a remote cluster. For the most accurate evaluation, though, you should measure on H100 GPUs on a service like Modal. In practice each experiment is around 5 cents, which is cheap enough to support fast iteration without turning the loop into guesswork.
The full technical contract, harness, and submission details live in the challenge repository on GitHub.