The standard playbook for making AI smarter boils down to one idea: feed it more text. But throwing more words at the model creates three compounding problems.
The more text in the window, the harder it is for the model to find the one fact it actually needs. Modern LLMs now process hundreds of thousands of words at once โ but relevance gets buried under sheer volume.
Human language is naturally fuzzy. "Bank" can mean a riverbank or a financial institution. In the model's internal math, these meanings blur together, creating noise that compounds at scale. The paper calls this Adversarial Polysemy โ words actively fighting each other for space in the model's limited dimensions.
Processing twice as many words costs roughly four times the compute (it grows quadratically, O(Nยฒ)). Even the fastest custom hardware โ like NVIDIA's B200 running FlashAttention-4 at 1,605 TFLOP/s โ can only optimise the speed of this process. It can't change the underlying math.
In standard transformer attention, retrieval probability is governed by the softmax
partition function: O = softmax(QKแต / โd) V. For reliable retrieval
(Psig > 1โฮต), the target signal must exceed the Critical Energy
Gap:
ฮE = E_sig โ ฤ_noise > ln(Nโ1) + ln((1โฮต)/ฮต)
As sequence length N grows, the log-barrier forces the signal to work exponentially harder to stand out. Meanwhile, the model packs millions of concepts into a fixed hidden dimension (e.g. dk=12,288). The Welch Bound proves that perfect orthogonality is mathematically impossible when vectors exceed available dimensions โ so word embeddings inevitably overlap, inflating ฤnoise. The paper terms this Adversarial Polysemy.
Empirical measurements on Meta-Llama-3-8B show English BPE tokens have a Participation Ratio of 94.34 (highly diffuse) vs. 31.93 for topo-categorical clusters (concentrated). This validates the geometric argument that natural language representations occupy a noisier, more spread-out region of embedding space.
The FlashAttention lineage addresses the IO bottleneck of self-attention:
FA-1/2: Loop tiling avoids O(Nยฒ) matrix materialisation in HBM.
FA-3: Exploits Hopper (H100) async pipelines + FP8 block quantisation
(RMSE drops from 3.2e-4 to 1.9e-4 in FP16).
FA-4: On Blackwell B200, shifts to integer ALUs for Taylor-series
exponential approximation, achieving 1,605 TFLOP/s.
Despite these feats, processing 100Kโ1M tokens still requires linear memory and quadratic FLOPs. FlashAttention-4 is the pinnacle of optimising the symptom, not a cure for softmax partition function inflation.