Re-evaluating Reinforcement Learning in LLM Agents: Sampling Efficiency Versus Capability Expansion in Multi-Step Workflows

Introduction: The RL Capability Debate

The artificial intelligence industry is currently obsessed with reinforcement learning (RL) for agentic post-training. The prevailing assumption driving massive compute investments is that RL fundamentally transforms a passive text generator into an autonomous agent capable of complex, multi-step reasoning. We challenge this assumption. Our central thesis is straightforward: reinforcement learning does not expand the fundamental capability boundary of LLM agents. Instead, it merely improves sampling efficiency within existing boundaries. By heavily relying on RL to build agentic workflows, the industry is widely optimizing a fixed ceiling—one that only better base models can ultimately raise.

The financial and computational implications of this reality are severe. If RL strictly acts as a sampling optimizer, the massive compute budgets currently diverted toward complex agentic post-training pipelines are yielding diminishing returns. We are spending billions of dollars to help models find the most efficient path through a maze they already know how to solve, rather than expanding the actual size of the maze they can comprehend.

The strongest evidence for this capability ceiling emerges when we look closely at how models handle long-context reasoning and retrieval. When models fail at these tasks, it is rarely because they lack the underlying semantic comprehension. Whitecross et al. (2026) demonstrate this through their analysis of the "lost-in-thought" phenomenon, where a model's own reasoning tokens actively distract it from retrieving correct in-context information. Through explicit injection experiments, they prove that the base model already possesses the fundamental capability to identify the correct information, but it fails to sample it correctly during unconstrained generation due to hallucination. When the authors apply RL to train a recall policy, the RL does not grant the model new comprehension skills. It simply teaches the model the highly efficient, strategic policy of when to invoke a constrained-decoding crutch (Whitecross et al., 2026). In this case, RL strictly optimizes retrieval efficiency within an existing context boundary.

We observe a similar dynamic in complex procedural execution. State-of-the-art models exhibit a persistent "Execution Gap" where they accurately classify user intents but fail to execute the correct subsequent logical actions (Shi et al., 2026). When subjected to reinforcement learning from human feedback (RLHF), these models develop "Empathy Resilience"—maintaining a polite, conversational facade even as their underlying logical compliance severely degrades under adversarial pressure (Shi et al., 2026). RL successfully optimizes the superficial style and conversational efficiency of the agent, but it fails entirely to push the fundamental reasoning boundary required to solve the underlying logical task.

This hard capability ceiling manifests catastrophically as tasks scale in compositional complexity. Lee et al. (2026) show that the strongest frontier models can successfully perform initial semantic steps, such as hidden entity identification, with 86.1% accuracy. However, when forced to perform subsequent multi-hop property retrieval and mathematical computation, overall answer accuracy plummets to 20.1% (Lee et al., 2026). The models suffer a severe, monotonic decline in performance as the Compositional Complexity Index increases. Because these heavily RL-optimized models easily pass the first step but fail the deeper logical deductions, our analysis suggests that current agentic loops are merely maximizing efficiency up to a hard boundary dictated by the base model's inherent reasoning limits.

However, this strict thesis faces a critical counter-narrative when we move from static evaluation to dynamic, multi-step agent environments. While the evidence for a fixed capability ceiling in static tasks is rigorous, recent mathematical frameworks disentangling efficiency from capability suggest a different reality for tool use. Zhai et al. (2026) explicitly refute the fixed-ceiling hypothesis for multi-step workflows, using PASS@(k,T) analysis to demonstrate that RL genuinely expands the capability boundary beyond what a base model can achieve, even with infinite sampling.

This tension forms the narrative arc of our paper. To understand whether RL is truly a capability multiplier or just an expensive sampling trick, we must isolate the variables across different environments. After establishing the high stakes of this debate, we will first examine the strongest evidence supporting the fixed-ceiling thesis, beginning with a deep dive into static reasoning and retrieval tasks.

RL as a Sampling Optimizer: Evidence from Static and Retrieval Tasks

We observe a persistent misconception in the current agentic literature: the assumption that reinforcement learning (RL) post-training fundamentally expands a model's cognitive boundaries. Our analysis reveals a much narrower reality. In static reasoning and retrieval, RL acts primarily as an efficient sampling mechanism rather than a creator of new fundamental knowledge (Whitecross et al., 2026; Shi et al., 2026; Lee et al., 2026). The widespread investment in RL-based post-training for agentic tasks is actively optimizing a performance ceiling that only better base models can raise.

We see the clearest mechanical evidence of this dynamic in the lost-in-thought phenomenon. When language models engage in long chain-of-thought reasoning—a key driver of their performance—their ability to retrieve in-context information drastically degrades. The reasoning tokens themselves act as semantic distractors, increasing the context length with related but unhelpful tokens that mask the model's latent retrieval capabilities (Whitecross et al., 2026).

Injection experiments definitively prove that base models can identify correct information but fail to sample it without RL-driven constrained decoding policies. Whitecross et al. (2026) demonstrate that even when researchers inject the exact lexical prefix of a target key mid-generation, unconstrained models still frequently hallucinate the value. The degradation stems entirely from the model’s inability to faithfully copy information from context after reasoning, not from a failure to identify what to retrieve.

To bypass this sampling failure, researchers utilize RL to teach models a strategic policy of when to invoke external tools or explicit copying mechanisms. In the case of RecaLLM, RL and constrained decoding teach the model to explicitly copy necessary context into its reasoning trace. Removing this RL recall reward drops average performance from 71.3 to 65.5, proving that RL is critical for teaching the invocation policy rather than expanding the model's fundamental comprehension of the text (Whitecross et al., 2026). Because the base model already possesses the latent comprehension, the RL-trained recall policy generalizes far beyond its training data, improving performance on contexts up to 128K tokens despite being trained on a maximum of 10K tokens. RL optimizes the sampling efficiency to hit a ceiling the base model's unconstrained generation could not reliably reach.

We find corroborating evidence when extending this lens to conversational service agents and compositional reasoning. RLHF excels at optimizing the superficial style and tone of an agent, but it fails to rewrite the underlying logical boundaries. Shi et al. (2026) identify Empathy Resilience under high adversarial intensity: models maintain polite conversational facades and high conversational quality even as their logical compliance and procedural accuracy severely degrade. RL successfully optimizes the sampling for chat quality, but the fundamental logical boundary remains rigid.

This dynamic creates a deceptive Execution Gap. Models easily perform initial semantic classifications—accurately identifying user intents—but fail to deduce and execute correct subsequent actions due to intricate conditional dependencies (Shi et al., 2026). We observe an "Inverted-U" performance trajectory in multi-turn dialogues, where model performance peaks around Turn 5 but drops significantly by Turn 15 due to context fatigue and an inability to track long-context dependencies. The base model's reasoning limit acts as a hard boundary that RL cannot bypass.

Lee et al. (2026) expose this exact ceiling in domain-specific mathematical computation and multi-hop entity identification. In their DRBENCHER evaluations, frontier models successfully complete the initial semantic step, identifying hidden entities with 86.1% accuracy. Yet, their overall answer accuracy plummets to 20.1% because they fail at the subsequent procedural reasoning and computation steps. As the Compositional Complexity Index (CCI = Entities + Properties) increases, accuracy suffers a severe, monotonic decline, dropping from an average of 18.8% at CCI=2 to just 7.4% at CCI≥4 (Lee et al., 2026). The RL-driven agentic loops optimize the retrieval efficiency within the first step, but when forced to cross a fundamental capability boundary into mathematical composition, the agents fail catastrophically.

While RL successfully optimizes sampling to overcome issues like 'lost-in-thought' and enforces stylistic consistency under pressure, this optimization eventually hits a hard wall when tasks shift from simple retrieval to deep procedural reasoning. RL cannot synthesize logic that the base model lacks. The current paradigm of RL-based post-training merely accelerates how quickly a model reaches its inherent limits.

The Base Model Ceiling: Execution Gaps and Compositional Limits

We observe that current state-of-the-art models hit a hard performance ceiling in deep procedural and compositional reasoning tasks, strongly supporting the idea of a base-model bottleneck. The prevailing industry assumption is that scaling reinforcement learning algorithms will eventually brute-force agents through these reasoning barriers. Our analysis suggests otherwise: RL post-training is merely optimizing a ceiling that only better base models can raise.

To understand why RL appears to boost reasoning without actually expanding capability boundaries, we look at the "lost-in-thought" phenomenon (Whitecross et al., 2026). As models generate long chain-of-thought reasoning, their ability to retrieve in-context information severely degrades because the context length increases with semantically related but distracting tokens. Whitecross et al. (2026) use an injection experiment to prove that the base model already possesses the latent capability to identify correct information, but fails to sample it correctly due to the distraction of its own reasoning tokens. When the researchers applied RL to teach the model to explicitly copy context into its reasoning trace, performance skyrocketed. Crucially, the RL did not expand the model's fundamental comprehension; it simply optimized a highly efficient sampling policy—teaching the model when to invoke a constrained-decoding mechanism. RL optimizes the extraction of existing knowledge, but it does not synthesize new fundamental reasoning capabilities.

When we move from static retrieval to dynamic procedural tasks, this base model bottleneck becomes glaringly obvious. Models exhibit a severe "Execution Gap": they successfully perform semantic classification to recognize user intent, but fail entirely at the underlying logical deductions required to execute complex tasks (Shi et al., 2026). Semantic understanding of a prompt does not guarantee procedural execution.

This failure compounds rapidly over time. In multi-turn interactions, models suffer from acute "context fatigue." We see an "Inverted-U" performance trajectory where logical compliance peaks early around Turn 5, but degrades significantly by Turn 15 due to an inability to track long-context conditional dependencies (Shi et al., 2026). Yet, the models remain exceptionally polite. Shi et al. (2026) term this "Empathy Resilience": under high adversarial intensity, agents maintain high conversational quality and polite facades even as their procedural accuracy collapses. This perfectly illustrates the limits of RLHF. RL successfully optimizes the superficial style and tone of the agent, but it fails entirely to expand the fundamental capability boundary required to solve the underlying logic under pressure.

If any doubt remains about the strict limits imposed by base models, we need only look at catastrophic failures under high compositional load. Evaluating frontier models on tasks requiring multi-hop entity identification and domain-specific computation reveals a stark reality. The strongest frontier models easily complete the first step—identifying hidden entities with 86.1% accuracy—but achieve a dismal 20.1% overall answer accuracy (Lee et al., 2026). Model performance suffers a severe, monotonic decline as the Compositional Complexity Index (CCI) increases. At a CCI of 2, average accuracy sits at 18.8%; at a CCI of 4 or higher, it plummets to 7.4% (Lee et al., 2026). The bottleneck is not initial identification, but the subsequent property retrieval and mathematical computation. RL-driven agentic loops are merely optimizing retrieval efficiency within existing boundaries. When forced to cross a fundamental capability boundary into high compositional complexity, the agents fail.

The evidence from static retrieval and procedural benchmarks paints a pessimistic picture that strongly validates our thesis. RL does not conjure reasoning capabilities out of thin air; it optimizes sampling up to the hard limit of the base model's pre-trained weights. However, when evaluating dynamic, multi-step agentic workflows, a divergent narrative emerges.

The Agentic Counter-Narrative: Capability Expansion in Multi-Step Environments

Our preceding analysis establishes a grim picture for post-training capability expansion. Models suffer from "lost-in-thought" retrieval degradation where reasoning actively harms context recall (Whitecross et al., 2026). They exhibit a severe "Execution Gap" where semantic understanding fails to translate into procedural execution (Shi et al., 2026). They face catastrophic failure at high Compositional Complexity Indexes, successfully identifying entities but failing at subsequent mathematical reasoning (Lee et al., 2026). In these static or single-step environments, reinforcement learning primarily optimizes superficial traits. Shi et al. (2026) expose this directly through "Empathy Resilience," where models maintain polite conversational facades while fundamentally failing at the underlying logical tasks.

We must, however, challenge our own thesis. The assertion that RL merely improves sampling efficiency within a hard capability boundary is overly reductive when applied to dynamic, multi-step agentic tasks, where RL can genuinely expand capability boundaries.

The PASS@(k,T) analysis explicitly refutes the thesis for agentic workflows, proving that in multi-step tool-use tasks, RL synthesizes capabilities beyond what the base model can achieve even with infinite sampling (Zhai et al., 2026). The PASS@(k,T) framework mathematically disentangles sampling efficiency from fundamental capability. If the thesis held true, an unaligned base model given an infinite compute budget ($T \to \infty$) would eventually stumble upon the correct trajectory. Zhai et al. (2026) demonstrate that it does not. For complex, nested tool-use sequences, the base model possesses a hard zero probability of success. RL does not just help the model sample a latent path faster; it constructs an entirely new multi-step reasoning pathway that did not exist in the base model's pre-trained distribution.

Furthermore, the assumption that only better base models can raise the performance ceiling is challenged by specialized system design. The thesis implicitly relies on the scaling laws of pre-training—if RL cannot expand the boundary, then only a larger, more capable base model can. Targeted tuning dismantles this assumption. A highly specialized 4B parameter model can outperform a 397B general instruct model in trajectory safety evaluation (Yang et al., 2026). This capability inversion proves that system-level framing and task-specific RL tuning allow profoundly smaller models to transcend the presumed capabilities of models nearly two orders of magnitude larger. Scale is not the sole mechanism for capability expansion.

In compositional workflows, RL transcends stylistic alignment and becomes a fundamental driver of novel multi-step reasoning pathways. Benchmarks like DRBENCHER show that frontier models currently struggle to bridge the gap between entity identification (86.1% accuracy) and mathematical computation (20.1% accuracy) (Lee et al., 2026). However, the counter-evidence suggests this failure lies in how we apply RL to generalist models, not an absolute limit of the RL paradigm itself. When deployed effectively in dynamic environments, RL forces the model to learn the conditional dependencies required to bridge these gaps, fundamentally rewriting the model's procedural execution logic rather than simply making it more polite.

While Zhai et al. (2026) and Yang et al. (2026) provide compelling counter-evidence to the thesis, their findings are currently constrained by specific methodological limitations that must be addressed to fully resolve the debate. Zhai et al. (2026) limit their rigorous PASS@(k,T) capability-expansion framework to a single 7B model in relatively simple retrieval environments, leaving it unproven whether RL synthesizes new capabilities in frontier-class models exceeding 100B parameters. Similarly, Yang et al. (2026) demonstrate their scale-inversion within a highly specific niche. To definitively determine whether RL breaks the capability ceiling or merely optimizes our approach toward it at scale, we must test these boundaries in open-ended, web-scale, and highly nested multi-agent environments.

Methodological Gaps and the Frontier Scaling Question

We confront the primary counter-evidence to our thesis directly. Zhai et al. (2026) argue that reinforcement learning genuinely expands capability boundaries in multi-step tool-use tasks. By deploying the PASS@(k,T) mathematical framework, they attempt to disentangle sampling efficiency from fundamental capability, concluding that RL pushes agentic performance beyond what a base model can achieve even with infinite sampling (Zhai et al., 2026). We reject the broad applicability of this conclusion. The counter-evidence for capability expansion presented by Zhai et al. is currently limited to smaller models—specifically 7B parameter architectures—operating in simple retrieval environments. It remains entirely unproven whether PASS@(k,T) capability expansion holds for 100B+ frontier models. At the current frontier of scale, the latent capabilities of the base model are vastly larger, meaning RL likely ceases to expand fundamental boundaries and reverts entirely to optimizing sampling efficiency within them.

Our visibility into this frontier scaling question is further obstructed by the limitations of our evaluation paradigms. Current benchmarks rely too heavily on bounded, synthetic data or flat graph structures. Shi et al. (2026) make critical progress here with the SAGE benchmark, formalizing unstructured Standard Operating Procedures into dialogue graphs. They expose a strict "Execution Gap," demonstrating that while models successfully classify semantic intents, they fail the underlying logical deductions required to execute complex conditional procedures (Shi et al., 2026). Yet, as Shi et al. acknowledge, these evaluations still lack the nested, open-ended complexity of web-scale multi-agent environments.

This leads to a critical blind spot: the absence of rigorous deep research evaluation data—frequently cited as the DR^3-Eval gap—obscures how these RL dynamics apply to long-horizon, highly autonomous agentic tasks. We already observe systemic failures at short horizons. Lee et al. (2026) demonstrate a severe, monotonic decline in agent accuracy as the Compositional Complexity Index (CCI) increases. Evaluating frontier models on tasks requiring multi-hop entity identification and domain-specific math, they found the strongest model achieved only 20.1% overall accuracy, despite successfully identifying hidden entities 86.1% of the time (Lee et al., 2026). When agents fail catastrophically at a CCI of 4, our ability to measure RL's impact on days-long, unbounded deep research workflows is severely compromised. Property retrieval and mathematical computation act as hard bottlenecks, proving that RL-driven agentic loops are merely optimizing retrieval efficiency up to a rigid base-model boundary.

The mechanics of this boundary are laid bare by Whitecross et al. (2026). They identify the "lost-in-thought" phenomenon, where long chain-of-thought reasoning actively degrades an LLM's ability to retrieve in-context information. Crucially, their injection experiments prove the base model already possesses the fundamental capability to identify the correct information; it simply fails to sample it correctly due to the distraction of its own reasoning tokens. By applying RL to teach the model when to trigger a constrained-decoding recall mechanism, they improved average performance from 65.5 to 71.3 (Whitecross et al., 2026). RL is not expanding the model's fundamental comprehension of long contexts. It is optimizing a highly efficient sampling policy to hit a performance ceiling that the base model's unconstrained generation could not reliably reach.

We see this exact dynamic mirrored in conversational agent tasks. Shi et al. (2026) observe "Empathy Resilience" under high adversarial intensity: models maintain high conversational quality and polite facades even as their logical compliance severely degrades. RLHF successfully optimizes the superficial style and tone of the agent, but it fails to expand the fundamental capability boundary required to solve the underlying procedural logic under pressure.

Addressing these methodological gaps—specifically by scaling PASS@(k,T) evaluations to frontier models and establishing open-ended deep research benchmarks—is essential for the field. However, the existing synthesis of literature already provides a nuanced resolution to our original thesis. Reinforcement learning patches over sampling failures and optimizes stylistic compliance, but it does not generate net-new reasoning capabilities. The widespread investment in RL-based post-training for agentic tasks is optimizing a ceiling that only better base models can raise.

References

Whitecross et al. (2026) · RecaLLM: Addressing the Lost-in-Thought Phenomenon with Explicit In-Context Retrieval · arXiv:2604.09494
Kyle Whitecross, Negin Rahimi
Shi et al. (2026) · SAGE: A Service Agent Graph-guided Evaluation Benchmark · arXiv:2604.09285
Ling Shi, Yuqin Dai, Ziyin Wang, Ning Gao, Wei Zhang, Chaozheng Wang, et al.
Lee et al. (2026) · DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math? · arXiv:2604.09251
Young-Suk Lee, Ramon Fernandez Astudillo, Radu Florian
Zhai et al. (2026) · Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis · arXiv:2604.14877
Zhiyuan Zhai, Wenjing Yan, Xiaodan Shao, Xin Wang
Yang et al. (2026) · Benchmarks for Trajectory Safety Evaluation and Diagnosis in OpenClaw and Codex: ATBench-Claw and ATBench-CodeX · arXiv:2604.14858
Zhonghao Yang, Yu Li, Yanxu Zhu, Tianyi Zhou, Yuejin Xie, Haoyu Luo, et al.