There’s a quiet shift happening in AI evaluation that most people outside the trenches haven’t noticed yet. We’ve spent years obsessing over training costs — how many GPUs, how much electricity, how long to convergence. But the evaluation side has quietly become the bigger expense, especially once you move beyond static benchmarks into agentic systems.
The Holistic Agent Leaderboard (HAL) dropped a number that made me blink: roughly $40,000 to run 21,730 agent rollouts across 9 models and 9 benchmarks. That’s not training. That’s just checking how well the things work. A single GAIA run on a frontier model hits $2,829 before you even think about caching. Exgentic’s sweep across agent configurations found a 33× cost spread on identical tasks, with scaffold choice emerging as the primary cost driver. UK-AISI scaled agentic steps into the millions just to study inference-time compute behavior.
In scientific ML, The Well costs about 960 H100-hours to evaluate one new architecture and 3,840 H100-hours for a full four-baseline sweep. That’s real compute that could have gone into actual research.
The static benchmark era had a fix
This didn’t start with agents. Back in 2022, Stanford’s CRFM released HELM and the per-model accounting was already eye-watering: $85 for OpenAI’s code-cushman-001 up to $10,926 for AI21’s J1-Jumbo, plus 540 to 4,200 GPU-hours for open models. BLOOM and OPT at the top end. IBM noted that putting Granite-13B through HELM could consume 1,000 GPU hours. Across HELM’s 30 models and 42 scenarios, the aggregate hit roughly $100,000.
But Perlitz et al. found something interesting: a 100× to 200× reduction in compute preserved nearly the same rankings. Flash-HELM turned that into a coarse-to-fine procedure — run cheap evaluations first, then spend high-res compute only on top candidates. tinyBenchmarks compressed MMLU from 14,000 items to 100 anchor items at about 2% error. The Open LLM Leaderboard collapsed from 29,000 examples to 180. Anchor Points showed that as few as 1 to 30 examples could rank-order 87 language-model/prompt pairs on GLUE.
Static benchmarks had a weakness you could exploit: model differences concentrate in a small subset of items. Rankings survive aggressive subsampling.
Agent evals break that trick entirely
Agent benchmarks are a completely different animal. They’re noisy, scaffold-sensitive, and only partly compressible. The cost spread is brutal: HAL’s single benchmark runs vary by four orders of magnitude across tasks, and by three orders within some individual benchmarks. Claude Opus 4.1 charges $15 per million input tokens and $75 per million output. Gemini 2.0 Flash charges $0.10 and $0.40. That’s a two-order spread on input alone.
Worse, higher spend doesn’t reliably buy better results. On Online Mind2Web, Browser-Use with Claude Sonnet 4 cost $1,577 for 40% accuracy. SeeAct with GPT-5 Medium hit 42% for $171. The HAL paper notes a 9× cost difference for just a two-percentage-point accuracy gap. On GAIA, an HAL Generalist with o3 Medium cost $2,828 for 28.5% accuracy, while a different agent hit 57.6% for $1,686. CLEAR found across 6 SOTA agents on 300 enterprise tasks that accuracy-optimal configurations cost 4.4 to 10.8× more than Pareto-efficient alternatives with comparable real-world performance.
Training-in-the-loop is a different beast
Then there’s training-in-the-loop evaluation, which is expensive by construction. Every iteration requires running the agent, collecting data, and updating. The costs compound. And when you try to add reliability through repeated runs, you’re multiplying costs further. The Pythia checkpoint analysis from Perlitz et al. showed that evaluation costs may even surpass pretraining when evaluating checkpoints. For small models, evaluation becomes the dominant compute line item across the entire development cycle.
What this means practically
This is starting to reshape who can meaningfully evaluate AI systems. If a single agent eval run costs thousands of dollars, and you need multiple runs for statistical significance, you’re looking at serious budgets just to know if your system works. Smaller labs and independent researchers are getting priced out of comprehensive evaluation. The field risks converging on a small set of well-funded evaluations that may not capture the full picture.
The compression techniques that worked for static benchmarks don’t transfer cleanly to agentic settings. You can’t subsample agent trajectories the way you can subsample multiple-choice questions. The interactions are sequential and context-dependent. The scaffold matters as much as the model. Token budgets multiply costs.
I don’t have a clean solution here. But acknowledging the problem is the first step. We need better evaluation infrastructure that doesn’t require $40K sweeps to get reliable signals. Otherwise, the cost of knowing whether your AI works will become a barrier that only the well-funded can cross.
Comments (0)
Login Log in to comment.
Be the first to comment!