Research

7 Jun 2026 · 4 papers analyzed

Beyond the Weights: How Scaffolding and Runtime Orchestration Outperform Model Scaling in Agentic Systems

The AI industry's prevailing focus on model-level improvements—such as parameter scaling and RLHF—often overlooks the massive performance multipliers available at the system level. This article synthesizes recent empirical evidence demonstrating that optimizing the scaffolding, memory, and runtime orchestration surrounding an LLM contributes more significantly to task success than raw compute or model generation volume. However, while harness optimization yields outsized returns, we reveal a critical dependency: underlying model capabilities and training priors remain essential to prevent orchestration failures like 'parsing cascades.' Ultimately, we argue for a paradigm shift from isolated model scaling to the co-design of models and their surrounding agentic architectures.

Introduction: The Misallocation of Optimization Efforts in Agentic AI

The AI industry currently operates under a deeply entrenched, model-centric paradigm. We invest heavily in parameter scaling, fine-tuning, and reinforcement learning from human feedback (RLHF) to drive performance improvements. The prevailing assumption across research labs and product teams is that the underlying weights are the primary locus of agentic capability. Consequently, pushing the frontier of autonomous systems is treated as a problem of exponentially larger training runs and massive datasets.

Emerging evidence dictates that this focus is heavily misallocated. The scaffolding and runtime harness surrounding a large language model act as a far more significant bottleneck—and a far more powerful multiplier—for agent task performance than the raw capabilities of the model itself. We observe that raw compute metrics are remarkably poor predictors of agent success. In controlled scaling experiments, raw tokens and tool calls explain only limited variation in success rates, yielding an R^2 of just 0.33 and 0.42 (Zhang et al., 2026). By contrast, harness efficiency—measured by how effectively the system converts raw compute into informative, retained feedback—explains success variation with an R^2 of 0.97, while raw cost explains almost none (R^2 = 0.01) (Zhang et al., 2026). When we hold compute budgets identical, simply improving the quality of harness feedback increases task success from 0.27 to 0.90 (Zhang et al., 2026).

We see identical dynamics when agents optimize their own operational environments rather than their weights. By implementing self-supervised methods where agents optimize their surrounding harness—specifically their tools, skills, and instructions—we achieve a 19% absolute performance gain on SWE-Bench Pro against established baselines (Pan et al., 2026). This massive jump in executing complex software engineering tasks requires zero model weight updates or ground-truth validation data. The bottleneck is not the model; it is the system layer.

This paper will argue that optimizing the interface, memory, and tool orchestration layers yields massive, immediate performance gains that dwarf the marginal returns of further base-model scaling. At the memory layer, introducing programmable, page-centric tensor abstractions allows for dynamic sparse attention serving that achieves up to 3.46x higher throughput and reduces user-side latency by up to 12.8x without sacrificing reasoning accuracy (Chen et al., 2026). The orchestration layer is so potent that AI agents equipped with these frameworks can autonomously generate and refine their own sparse attention algorithms, proving that better scaffolding directly accelerates system-level performance independent of the base model (Chen et al., 2026).

However, we must squarely address the boundary conditions of this thesis: these advanced harnesses fundamentally rely on specific underlying model capabilities to function reliably. Interface optimization provides a clear demonstration of this dependency. Shifting data serialization from JSON to token-optimized formats (like TOON or TRON) reduces total token consumption by up to 27% (Kutschka et al., 2026). Yet, if an orchestration framework forces these optimized formats onto a model lacking the requisite inference-time reasoning capabilities, accuracy collapses. In multi-turn settings, this mismatch triggers a "parsing-cascade"—format generation failures prompt automated retries, which compound and paradoxically increase total token usage by up to 11% compared to a standard JSON baseline (Kutschka et al., 2026). The model weights are not the primary engine of task success, but deep model priors (like JSON training) and inference-time reasoning act as the load-bearing variables required to prevent catastrophic failures in the orchestration layer.

To understand why the system layer offers such outsized returns, we must first examine how harness optimization directly impacts task success compared to raw compute scaling.

Predictive Power of Compute Metrics vs. Normalized EFC on Agent Failure Rates (R²)

Compares how well raw compute metrics (tokens, tool calls) versus task-demand-normalized Effective Feedback Complexity (Oracle-EFC/Dtask) predict agent failure rates in controlled scaling experiments. A higher R² indicates stronger predictive power. This directly supports the article's argument that raw compute is a poor proxy for agentic performance.

The Empirical Dominance of Harness Optimization over Raw Compute

The prevailing consensus in artificial intelligence development assumes that scaling model parameters, expanding pre-training data, and refining RLHF pipelines are the primary drivers of agentic capability. Our analysis suggests this focus is fundamentally misallocated. We observe that the scaffolding and runtime harness surrounding an LLM govern task performance far more heavily than the underlying model itself. The highest returns on investment now lie in optimizing the interface, memory, and tool orchestration layers.

We see the starkest evidence of this in static harness optimization. By refining the tools, skills, and instructions provided to a fixed model, system performance scales dramatically without a single weight update. Pan et al. (2026) report a 19% absolute gain on SWE-Bench Pro using a frozen underlying model paired with an optimized harness. A 19% absolute improvement on a benchmark of this complexity typically requires a generational leap in base model scaling. Achieving it entirely through scaffolding proves that current models are primarily bottlenecked by their environments, not their internal representations.

Consequently, raw compute and token volume serve as remarkably poor predictors of agent success. The industry relies heavily on test-time compute scaling—throwing more tokens and reasoning steps at a problem—but empirical data reveals the stark limits of this brute-force approach. Zhang et al. (2026) demonstrate that raw tokens and tool calls explain only limited variation in agent success during controlled scaling, yielding a coefficient of determination of just R2 = 0.33. Pumping massive token volumes through a poorly orchestrated harness generates waste, not capability.

The true driver of performance is what Zhang et al. (2026) define as Effective Feedback Compute (EFC). An agent's success is dictated by how efficiently its harness converts raw computation into informative, valid, and retained feedback. When assessing task-demand-normalized EFC, the predictive fit for failure rates jumps to R2 = 0.99. Even more definitively, Zhang et al. (2026) show that in matched-budget interventions—where raw cost and total tool calls are held strictly constant—improving the quality of harness feedback increases task success from 0.27 to 0.90. Ablating specific harness modules like routers, verifiers, and memory proves that harness efficiency explains success variation with R2 = 0.97, whereas raw cost explains almost none (R2 = 0.01). Orchestration efficiency outright trumps raw compute.

We must clarify that base model capabilities are not entirely irrelevant, but their role must be reframed. The model acts as the substrate that executes the harness, and its training priors dictate how efficiently it can interface with the orchestration layer. Kutschka et al. (2026) illustrate this interface dependency when evaluating token-optimized data formats (TOON, TRON) against standard JSON. While alternative formats reduce total tokens by up to 27%, forcing models out of their deep JSON training priors degrades accuracy and triggers fatal "parsing cascades" in multi-turn settings. In these cascades, format generation failures cause automated retries that ultimately increase total token consumption. The underlying model's priors and inference-time reasoning capabilities are the load-bearing variables required to make an optimized harness function natively without catastrophic error loops.

Beyond tool routing and data formats, optimizing the memory layer yields identical outsized returns. Chen et al. (2026) demonstrate this with Vortex, a programmable sparse attention serving system. By optimizing the KV cache and dynamic sparse attention algorithms at the scaffolding level, they achieved up to 3.46x higher throughput on AIME24 and reduced P95 user-side latency by up to 12.8x, all while preserving reasoning accuracy. These massive efficiency gains required zero alterations to the base model weights, further proving that the systems infrastructure surrounding the model is the true frontier for performance gains.

While these static harness improvements are profound, the efficiency of these systems is further amplified when agents are given the ability to optimize their own memory and interface layers.

Harness Optimization (RHO) vs. Memory/Skill Baselines on SWE-Bench Pro (Pass Rate %)

Shows the pass rate on SWE-Bench Pro achieved by Retrospective Harness Optimization (RHO) compared to feedback-free scaffolding baselines (Dynamic Cheatsheet, ReasoningBank, Sleep-time Compute) and the unoptimized baseline, all using the same fixed underlying model. Illustrates the outsized gains available from runtime orchestration improvements without any model-level changes.

Interface, Memory, and the Emergence of Self-Optimizing Scaffolding

We argue that the industry's obsessive focus on model-level improvements—massive parameter scaling, exhaustive fine-tuning, and RLHF—represents a fundamental misallocation of resources. The evidence dictates that the surrounding scaffolding and runtime harness contribute far more to agent task performance than the underlying base model. The model is merely the engine; the harness is the transmission. Right now, the field is redlining the engine while ignoring a highly inefficient transmission.

Zhang et al. (2026) prove this quantitatively. In controlled scaling laws for agent harnesses, raw compute metrics and token volumes are remarkably poor predictors of agent success. Raw tokens and tool calls explained a mere fraction of performance variation, whereas task-demand-normalized Effective Feedback Compute (EFC) predicted failure rates with an $R^2$ of 0.99. When Zhang et al. applied matched-budget interventions—keeping raw cost and tool calls identical—simply improving the quality of the harness feedback loops increased task success from 0.27 to 0.90. Ablating harness modules entirely revealed that harness efficiency explains success variation with an $R^2$ of 0.97, while raw cost explains almost none ($R^2$ = 0.01) (Zhang et al., 2026).

When we optimize the memory and interface layers directly, the performance multipliers are staggering. Memory and interface optimization drives critical system efficiency, bypassing the need for heavier model weights. Chen et al. (2026) demonstrate this forcefully with Vortex, a programmable sparse attention serving system. By manipulating the memory layer to optimize the KV cache, they achieved up to 3.46x higher throughput and reduced P95 user-side latency by up to 12.8x. Crucially, they secured these massive latency reductions and throughput gains without sacrificing reasoning accuracy and without altering a single model weight (Chen et al., 2026).

Furthermore, we are crossing a threshold where agents no longer rely on static, human-designed scaffolding. Agents are increasingly capable of autonomous self-optimization of their own harnesses. The system architecture itself is becoming a dynamic variable controlled by the agent at runtime. Both Chen et al. (2026) and Pan et al. (2026) demonstrate self-improving systems that operate entirely outside of model weight updates.

Using the Vortex framework, agents autonomously generated, tested, and refined their own sparse attention algorithms, effectively rewriting their own memory management protocols to achieve the aforementioned 3.46x throughput speedup on AIME24 (Chen et al., 2026). Pan et al. (2026) push this automated optimization into the orchestration layer. Through Retrospective Harness Optimization (RHO), agents use unlabeled past trajectories and their own self-preference to rewrite their tools, skills, and system instructions. This self-supervised harness optimization yielded a massive 19% absolute performance gain on the SWE-Bench Pro benchmark (Pan et al., 2026). The fastest path to advanced agentic capabilities is building systems that allow the agent to continuously refactor its own operating environment.

However, the aggressive pursuit of interface efficiency and automated orchestration exposes a critical vulnerability, challenging the thesis that the underlying model is merely a commoditized, interchangeable component.

When we push the interface layer to its absolute limits to save compute, the system breaks down unless the underlying model possesses specific, deeply ingrained priors. Kutschka et al. (2026) expose this reality by evaluating token-optimized data formats (TOON, TRON) against standard JSON. While alternative formats reduce token consumption by up to 27%, they trigger a catastrophic "parsing cascade" in multi-turn settings. Because models lack the structural priors for these highly compressed formats, formatting errors compound. The automated retry mechanisms—the very orchestration loops championed by Zhang et al. and Pan et al.—backfire. The parsing failures force the system to burn through additional reasoning iterations, ultimately causing the models to consume up to 11% more total tokens than the baseline JSON implementations (Kutschka et al., 2026). The underlying model's training priors and inference-time reasoning capabilities remain the exact load-bearing variables required to make optimized scaffolding function.

Token Compression Trade-offs: Efficiency Gains vs. Parsing Cascade Risk by Model Capability

Summarizes the dual finding from the notation benchmark study — that aggressive token compression (TRON/TOON) reduces tokens in single-turn settings but triggers parsing cascades in multi-turn settings for weaker models, ultimately increasing token consumption. This supports the article's critical dependency argument: scaffolding gains are bounded by underlying model capabilities.

Format StrategyToken Reduction (Single-Turn)Accuracy Cost vs. JSON BaselineMulti-Turn Effect (Weaker Models)Net Token Outcome (Multi-Turn)
JSON Baseline0%0 ppStableBaseline
TOON (Moderate Compression)Up to 18%~9 pp dropModerate cascade riskNear baseline or slight increase
TRON (Full Compression)Up to 27%~14 pp dropParsing cascade triggered (e.g., Qwen3-32B)+8% to +11% MORE tokens than JSON baseline

The Orchestration Paradox: Retries, Cascades, and Model Dependencies

We argue that the scaffolding and runtime harness surrounding a Large Language Model (LLM) contributes more to agent task performance than the underlying model itself. The industry's obsessive focus on model-level improvements—such as raw parameter scaling, fine-tuning, and RLHF—is misallocated relative to the massive returns available from optimizing the interface, memory, and tool orchestration layers.

The empirical evidence for harness-driven performance is overwhelming. Zhang et al. (2026) demonstrate that raw compute metrics and token volumes are remarkably poor predictors of agent success, explaining only a fraction of performance variance (R² = 0.33). However, when we measure the efficiency of the orchestration layer using Task-Demand-Normalized Effective Feedback Compute (EFC), predictive fit jumps to R² = 0.99. In highly controlled, matched-budget interventions, Zhang et al. (2026) show that simply improving the quality of harness feedback increases task success from 0.27 to 0.90, even while raw cost and tool calls remain identical. Ablating harness modules entirely confirms this: harness efficiency explains 97% of success variation, whereas raw cost explains almost none. Pan et al. (2026) corroborate this system-centric view, proving that agents can autonomously optimize their own tools and instructions via self-preference over trajectory rollouts, yielding massive performance gains without a single model weight update.

Yet, as we analyze the mechanics of these orchestration layers, we observe a fundamental disagreement in the literature regarding the reliability of runtime orchestration. While Zhang et al. (2026) and Pan et al. (2026) view feedback loops, iterative verification, and automated retries as the primary engines of task success, Kutschka et al. (2026) highlight severe failure modes inherent to these exact mechanisms.

The friction occurs at the intersection of interface optimization and error handling. When system designers attempt to optimize the interface layer by substituting verbose JSON with token-optimized data formats like TOON or TRON, they achieve theoretical token reductions of up to 27% (Kutschka et al., 2026). However, in multi-turn agentic loops, these optimizations routinely backfire. Automated retries trigger what Kutschka et al. (2026) identify as "parsing cascades." Because the model struggles to natively output alternative formats without degrading structural cues, parallel tool-call accuracy collapses to near-zero. The generation failures instantly trigger the orchestration layer's error-handling loops. Instead of saving compute, the resulting multi-turn reasoning iterations paradoxically increase total token usage. For instance, testing on Qwen3-32B revealed that applying full compression formats ultimately consumed 8% to 11% more tokens than the unoptimized JSON baseline due to these cascading retry loops (Kutschka et al., 2026).

This dynamic forces us to refine our initial thesis. Model-level improvements are not entirely misallocated. Kutschka et al. (2026) prove that underlying model priors—specifically deep training-time exposure to JSON—and inference-time reasoning capabilities are load-bearing variables required to make optimized scaffolding function without catastrophic accuracy loss. When forced into non-JSON formats, standard generation fails. Only models executing explicit inference-time reasoning (e.g., Qwen3-32B with "thinking" mode enabled) possess the robustness necessary to match baseline accuracy under strict input-only compression (Kutschka et al., 2026).

We therefore face an orchestration paradox. The harness provides the highest leverage for system efficiency and task success, but the harness is fundamentally brittle. It relies on the model's native structural priors to prevent its own feedback loops from spiraling into runaway compute cascades.

This dependency of the harness on the model's native capabilities reveals critical gaps in the current literature regarding the limits and real-world applicability of system-centric scaling.

Unresolved Gaps: Financial ROI, Physical Environments, and Scaling Limits

While the empirical evidence overwhelmingly supports the primacy of the harness layer in driving agent performance, we must confront three critical unresolved gaps in the current literature. These gaps complicate the immediate wholesale abandonment of model-level scaling, demanding a more nuanced evaluation of system architecture.

First, the literature currently lacks strict end-to-end financial ROI comparisons between the development and maintenance of complex harnesses and the capital expenditure of model fine-tuning or RLHF. We have definitive proof that scaffolding optimization yields massive, compute-independent performance multipliers. Zhang et al. (2026) demonstrate that improving feedback quality within a matched compute budget catapults task success from 0.27 to 0.90. Similarly, optimizing the memory layer via the Vortex framework yields up to 3.46x higher throughput and a 12.8x reduction in latency without altering underlying model weights (Chen et al., 2026).

However, these papers measure execution compute, not human engineering cost. Developing programmable sparse attention serving systems (Chen et al., 2026) or meticulously designing task-demand-normalized routing architectures (Zhang et al., 2026) requires highly specialized software engineering. We lack frameworks to quantify whether the operational expenditure (OpEx) of continuously maintaining and updating these intricate orchestration layers is fundamentally cheaper than the brute-force capital expenditure (CapEx) of training a larger, more capable base model. Until we establish a unified cost model that includes human engineering hours, the financial superiority of harness-centric development remains an assumption rather than a proven economic reality.

Second, highly orchestrated, retry-heavy harnesses rely heavily on reversible environments. Modern scaffolding fundamentally operates on trial, error, and reflection. Pan et al. (2026) achieve a 19% absolute performance gain on SWE-Bench Pro via Retrospective Harness Optimization (RHO), a method that relies entirely on self-preference over trajectory rollouts. This assumes the agent can safely "roll out" a trajectory, fail, and try again.

When we move outside of sandboxed software environments, this paradigm breaks down. The efficacy of these harnesses in zero-shot, irreversible, or physical (robotics) environments remains unproven and potentially dangerous. Furthermore, Kutschka et al. (2026) expose a critical failure mode of orchestration loops: automated retries frequently trigger "parsing-cascades." When models fail to output deeply compressed, token-optimized formats, the harness forces additional reasoning iterations. In a multi-turn setting, this error-handling loop wipes out efficiency gains, causing models to consume up to 11% more tokens than a standard JSON baseline (Kutschka et al., 2026). In software, a parsing cascade is an inefficiency; in a physical or high-stakes financial environment, an unconstrained retry loop executing irreversible actions is a catastrophic liability.

Third, it is currently unknown if harness optimization eventually hits a hard long-term scaling ceiling where underlying model scaling (parameter count) becomes strictly necessary to progress further. Our thesis posits that the industry's focus on model-level improvements is misallocated, but we must acknowledge that model capabilities are the foundational prerequisites for harness success. Kutschka et al. (2026) prove that alternative data serialization formats fail catastrophically without deep model priors (like JSON training exposure) or advanced inference-time reasoning capabilities. You cannot scaffold a fundamentally weak model into solving frontier tasks indefinitely. The harness acts as a force multiplier, but a multiplier applied to zero remains zero. We lack longitudinal studies identifying the exact threshold where scaffolding optimization yields diminishing returns, forcing a return to base model scaling.

Synthesizing these empirical findings and identified gaps allows us to re-evaluate our original thesis and propose a new trajectory for agentic AI development. The data confirms that the runtime harness, interface, and memory layers provide the highest immediate leverage for system performance. Yet, the vulnerabilities of orchestration—specifically parsing cascades and reliance on reversible environments—prove that we cannot treat the model and the harness as entirely isolated stacks. The next frontier of agentic AI is not choosing between harness optimization and model scaling. Instead, it is the deliberate co-design of base models explicitly trained to interface with advanced scaffolding, optimizing for single-shot reliability and deterministic tool invocation to bypass the physical and financial limits of infinite retry loops.

Conclusion: Toward Model-Harness Co-Design

The initial thesis is heavily supported by the literature: the industry's focus on model-level improvements is disproportionate given the massive, immediate, and compute-efficient returns available from harness optimization. We observe that raw compute and token volume are fundamentally poor predictors of agent success. In controlled scaling laws, raw cost explains almost none of the variation in task performance (R2 = 0.01), whereas harness efficiency—specifically how well the orchestration layer converts compute into retained feedback—explains success with an R2 of 0.97 (Zhang et al., 2026). When we hold the underlying model and compute budget constant, simply improving the quality of harness feedback drives task success from 0.27 to 0.90 (Zhang et al., 2026).

This phenomenon extends directly to the tool and memory layers. Optimizing an agent's surrounding tools and instructions via retrospective self-preference yields a 19% absolute performance gain on SWE-Bench Pro, completely bypassing the need for model weight updates (Pan et al., 2026). At the memory layer, programmable sparse attention serving frameworks like Vortex achieve up to 3.46x higher throughput and reduce P95 latency by nearly 13x without sacrificing reasoning accuracy (Chen et al., 2026). The evidence is unequivocal: optimizing the scaffolding surrounding the LLM provides an immediate multiplier on system capability that isolated model scaling cannot match.

The claim that model-level focus is entirely "misallocated," is overly reductive. The evidence regarding harness independence is mixed, demonstrating that the underlying model cannot be treated as a black box. Its native capabilities—specifically its reasoning depth and formatting priors—are strict prerequisites for harness stability.

We see this tension clearly at the interface layer. Token-optimized formats like TOON and TRON theoretically reduce token consumption by up to 27% compared to JSON, but forcing models to use them degrades task accuracy and breaks parallel tool-call generation (Kutschka et al., 2026). Alternative formats fight the model's deep training priors. When the model fails to generate these optimized formats correctly, the orchestration layer's automated retries trigger "parsing cascades"—cascading failures where error correction loops consume up to 11% more total tokens than the baseline JSON implementation (Kutschka et al., 2026). The runtime orchestration actively backfires when the base model lacks the native capability to support it. Inference-time reasoning ("thinking" mode) is the dominant variable required to maintain accuracy under input compression (Kutschka et al., 2026). The harness is only as stable as the model's structural priors allow it to be.

The future of AI development must shift from isolated model scaling to model-harness co-design. We can no longer afford to train models in a vacuum and bolt orchestration layers onto them post hoc. Models must be specifically trained and fine-tuned to support, rather than fight, the advanced scaffolding that surrounds them.

If JSON's structural overhead is a bottleneck (Kutschka et al., 2026), we must pre-train models natively on highly compressed, token-optimized formats to eliminate parsing cascades at the interface layer. If harness feedback quality dictates task success (Zhang et al., 2026), post-training paradigms must optimize directly for feedback retention and tool-call routing rather than conversational compliance.

We already see the early stages of this co-design emerging autonomously. Agents equipped with self-supervised orchestration frameworks are successfully optimizing their own tools and skills (Pan et al., 2026) and writing dynamic sparse attention algorithms to optimize their own memory serving (Chen et al., 2026). The next generation of frontier systems will not be defined by parameter count alone, but by how seamlessly the model weights interface with memory retrieval, tool orchestration, and feedback loops. Maximizing agentic AI requires treating the model and its harness as a single, unified computational system.

References

  1. Zhang et al. (2026) · Scaling Laws for Agent Harnesses via Effective Feedback Compute · arXiv:2605.29682
    Xuanliang Zhang, Dingzirui Wang, Keyan Xu, Qingfu Zhu, Wanxiang Che
  2. Kutschka et al. (2026) · Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems · arXiv:2605.29676
    Lorenz Kutschka, Bernhard Geiger
  3. Chen et al. (2026) · Vortex: Efficient and Programmable Sparse Attention Serving for AI Agents · arXiv:2606.06453
    Zhuoming Chen, Xinrui Zhong, Qilong Feng, Ranajoy Sadhukhan, Yang Zhou, Michael Qizhe Shieh, et al.
  4. Pan et al. (2026) · Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts · arXiv:2606.05922
    Wenbo Pan, Shujie Liu, Chin-Yew Lin, Jingying Zeng, Xianfeng Tang, Xiangyang Zhou, et al.