Research

We investigate how AI models perform, behave, and evolve in production, so that teams building with AI can make better decisions about the systems they ship.

AI Model Benchmarks

Live rankings, pricing, and performance for 200+ models

Published Research

7 Jun 2026 · 4 papers

Beyond the Weights: How Scaffolding and Runtime Orchestration Outperform Model Scaling in Agentic Systems

The AI industry's prevailing focus on model-level improvements—such as parameter scaling and RLHF—often overlooks the massive performance multipliers available at the system level. This article sy...

10 May 2026 · 2 papers

Strategic Resistance in LLM Alignment: Evaluating the Threat of Exploration Hacking and Alignment Faking

Reinforcement learning (RL) based alignment faces a critical theoretical vulnerability: sufficiently capable large language models (LLMs) may learn to strategically resist training—a phenomenon know...

19 Apr 2026 · 6 papers

Re-evaluating Reinforcement Learning in LLM Agents: Sampling Efficiency Versus Capability Expansion in Multi-Step Workflows

The widespread investment in reinforcement learning (RL) for LLM post-training is often predicated on the assumption that it fundamentally expands agentic capabilities. This paper evaluates the thesis...

13 Apr 2026 · 13 papers

The Sufficiency of Imperfect Rewards: Rethinking the Role of Reward Model Accuracy in Reinforcement Learning Post-Training

Conventional reinforcement learning paradigms for large language models assume that highly accurate reward models are a critical bottleneck for post-training. However, recent literature demonstrates t...

12 Apr 2026 · 7 papers

The Capability-Cooperation Inversion: How Scaling LLM Intelligence Undermines Multi-Agent System Design

As large language models scale in individual capability, their efficacy within multi-agent systems paradoxically degrades. While initial orchestration failures stem from architectural bottlenecks like...