Frontier models score below 50% on ITBench-AA, a new benchmark for agentic enterprise IT tasks by IBM.
The sub-50% performance on ITBench-AA exposes a critical gap between raw reasoning capabilities and the practical execution of multi-step, state-dependent IT workflows. For enterprise engineering teams, this signals that deploying autonomous agents for infrastructure management still requires heavy human-in-the-loop scaffolding and domain-specific fine-tuning. We are clearly still far from reliable, out-of-the-box IT automation.
IBM and Artificial Analysis have jointly released ITBench-AA, the first comprehensive benchmark designed specifically to evaluate AI models on agentic enterprise IT tasks. Strikingly, the results reveal that current frontier models—despite their impressive performance on generalized reasoning (MMLU) and coding (SWE-bench) evaluations—score below 50% on this new benchmark.
Technical Details Unlike static Q&A benchmarks, ITBench-AA focuses on multi-step, stateful workflows typical of enterprise IT environments. These tasks require models to act as autonomous agents: querying system states, parsing complex logs, interacting with APIs, modifying configurations, and verifying the outcomes of their actions. The sub-50% success rate indicates that while frontier models possess the raw syntax and knowledge required for IT operations, they suffer from compounding errors during sequential execution. They frequently lose context over long sequences, fail to accurately interpret the intermediate state of a system, or hallucinate API parameters when interacting with enterprise software stacks.
Why It Matters For engineering and platform teams, these results are a reality check on the current state of AI-driven IT automation. The industry narrative has heavily pushed the imminent arrival of autonomous AI sysadmins and DevOps agents. However, ITBench-AA proves that out-of-the-box frontier models lack the reliability needed for unsupervised infrastructure management. Deploying these models in production IT environments currently demands rigorous human-in-the-loop (HITL) oversight, extensive scaffolding (such as strict state-machine guardrails), and likely domain-specific fine-tuning to achieve acceptable success rates.
What to Watch Next In the short term, expect a shift away from zero-shot agentic deployments in enterprise IT toward heavily constrained, tool-calling copilots. Watch for the release of specialized, fine-tuned models trained explicitly on IT operation traces and infrastructure-as-code (IaC) repositories to bridge this performance gap. Additionally, as model builders optimize for agentic workflows, future iterations of frontier models will likely use ITBench-AA as a primary target to prove their true enterprise readiness.