Signals
Back to feed
5/10 Research 11 May 2026, 18:04 UTC

Subquadratic announces new selective attention architecture supporting a 12 million-token context window.

Pushing context windows to 12M tokens using subquadratic scaling fundamentally changes how we handle RAG and long-document processing. If the claimed performance on MRCR v2 holds up without severe degradation in retrieval accuracy, this architecture bypasses the memory bottlenecks of standard quadratic attention. The promise of scaling to 50M tokens suggests a shift away from complex chunking heuristics toward true whole-corpus inference.

What Happened

Startup Subquadratic has claimed a major architectural breakthrough with the introduction of "Subquadratic Selective Attention." According to announcements circulating on X, this novel architecture successfully supports a massive 12 million-token context window while outperforming existing benchmarks such as MRCR v2. The team has also stated plans to scale this architecture to support up to 50 million tokens.

Technical Details

Standard transformer architectures rely on self-attention mechanisms where compute and memory requirements scale quadratically (O(N^2)) with sequence length. This creates severe bottlenecks for long-context tasks. "Subquadratic Selective Attention" implies a fundamental algorithmic shift to achieve subquadratic scaling—likely O(N log N) or even linear O(N) time complexity.

While the exact mechanics remain proprietary, achieving high scores on MRCR v2 at 12M tokens suggests the architecture successfully mitigates the "lost in the middle" degradation typically observed when stretching standard context windows via techniques like Rotary Position Embedding (RoPE) scaling or ring attention. It likely utilizes a form of dynamic sparsity or state-space routing to selectively attend to relevant tokens without computing the full attention matrix.

Why It Matters

For engineering teams, a reliable 12M+ token context window radically alters data ingestion pipelines. It has the potential to eliminate the need for complex Retrieval-Augmented Generation (RAG) architectures, vector databases, and intricate text-chunking heuristics. Instead of retrieving isolated semantic snippets, entire codebases, massive legal corpora, or years of financial data can be loaded directly into the prompt for holistic reasoning. Furthermore, as noted by market commentators alongside this news, the shift toward mega-context models will heavily drive demand for next-generation physical AI infrastructure to support the required memory bandwidth and compute density.

What to Watch Next

The immediate priority is verifying these claims through independent reproduction and third-party benchmarking. Engineers should watch for the release of technical papers detailing the exact time and space complexity of the attention mechanism, as well as needle-in-a-haystack retrieval accuracy across the full 12M context. Additionally, monitor their progress toward the 50M token milestone and look out for any model weight or API releases to test real-world inference latency and throughput.

architecture context-window attention-mechanism research llm