Attention is O(n²) in memory. Explain concretely what that means at 100k tokens, why it makes naïve attention infeasible, and how Flash Attention solves it without changing the math. Then explain what other approaches exist for truly long contexts.

Question

Accepted Answer

The O(n²) memory problem with real numbers

Standard attention computes softmax(QK^T / √d_k) · V. The intermediate QK^T matrix has shape [n, n]. At 100k tokens:

100,000 × 100,000 × 2 bytes (float16) = 20 GB per head

A transformer with 32 heads would need 640 GB just for attention matrices in one layer — impossible on any single GPU. Even at 8k tokens (GPT-4's context at launch), one layer's attention matrix is ~128 MB. At batch size 8: ~1 GB. This adds up across layers and makes training memory-bound.

Why naive attention is slow: HBM vs SRAM

GPU memory has two tiers:
- HBM (High Bandwidth Memory): large capacity (~40-80 GB on H100), ~3 TB/s bandwidth
- SRAM (on-chip cache): tiny capacity (~40 MB on H100), ~19 TB/s bandwidth — 6× faster

Naive attention writes the full n×n matrix to HBM, reads it back for softmax, writes softmax output to HBM, reads it back for the weighted sum. These HBM round-trips dominate wall-clock time.

Flash Attention: same math, no HBM round-trips

Flash Attention (Dao et al., 2022) reorders computation using tiling — processes the attention in small blocks that fit in SRAM:

1. Load small tiles of Q, K, V into SRAM
2. Compute partial QK^T within the tile
3. Use online softmax: maintain a running max and normalizer to update the softmax incrementally as new tiles arrive (no need to see all scores first)
4. Accumulate the weighted V sum directly
5. Never materialize the full n×n matrix in HBM

The result is mathematically identical to standard attention — not an approximation. Memory complexity drops from O(n²) to O(n). Empirically: ~10× less HBM memory, 2-4× wall-clock speedup.

Flash Attention 2 added better parallelism across the sequence dimension and reduced non-matmul FLOPs. Flash Attention 3 targets H100 FP8 and asynchronous execution.

Is Flash Attention enough for 1M tokens?

Flash Attention is IO-efficient but the FLOPs are still O(n²). At 1M tokens:
1M × 1M = 10^12 multiply-adds per head per layer

At H100 throughput (~1000 TFLOP/s), one attention layer takes ~1 second. Flash Attention solves the memory problem, not the compute problem at extreme context lengths.

Approaches for truly long context

Sparse attention: attend to a structured subset of positions — local windows + global tokens (Longformer), strided patterns, or learned sparse patterns. O(n·k) where k is window size.

Sliding window attention (Mistral, Gemma): each token attends only to a fixed window of nearby tokens. O(n·w) — fast, but global information propagates slowly across layers.

Ring Attention: distribute sequence across GPUs; each GPU processes a chunk, passing KV to the next GPU in a ring. Enables near-linear scaling with GPU count for very long sequences.

State space models (Mamba): replace attention with recurrent state updates. O(n) in both compute and memory. Can't attend to arbitrary past positions, but scales linearly. Hybrid architectures mix attention layers with SSM layers.

RoPE scaling for context extension

Models trained with RoPE at context length 4k degrade at 8k. Context extension techniques (YaRN, positional interpolation) rescale or extrapolate RoPE frequencies to extend the effective context without full retraining. Now standard for long-context model releases.