VAKRA: Why Enterprise AI Agents Keep Tripping Over Simple Workflows

IBM Research just dropped VAKRA, and honestly, the results are humbling for anyone who thinks we’re close to reliable enterprise AI agents.

VAKRA stands for something I can’t be bothered to spell out, but the point is this: it’s a benchmark that tests whether AI agents can actually do things in enterprise environments. Not answer trivia questions, not generate marketing copy, but chain together API calls, pull data from documents, and complete multi-step workflows that look like real business tasks.

Here’s the headline: models perform poorly. Not “could use some improvement” poorly. Actually poorly.

What VAKRA Actually Tests

The benchmark covers 62 domains with over 8,000 locally hosted APIs backed by real databases. Each task requires 3 to 7 reasoning steps that combine structured API calls with unstructured document retrieval. The agents have to figure out which tools to use, in what order, and how to interpret the results.

VAKRA breaks down into four capability areas:

API Chaining (2,077 test instances across 54 domains): Agents use business intelligence tools from the SLOT-BIRD and SEL-BIRD collections. Think Tableau-style data manipulation – filtering, sorting, aggregation. Each instance starts with a get_data() call that returns a lightweight preview and configures the server to expose the right tools. The example in the paper asks “Which football team has build-up play speed of 31, build-up plan dribbling of 53, and build-up play passing of 32?” The answer is FC Barcelona, but the agent has to chain five tool calls to get there.

Tool Selection (1,597 instances across 17 domains): This one uses REST-style APIs from the expanded REST-BIRD collection. The twist is that each domain has between 6 and 328 tools (average 116). The OpenAI API spec only allows 128 tools in a single request, so agents that rely on that interface hit a hard ceiling. This forces some architectural decisions about how you expose tools to the agent.

Document-Grounded API Chaining (1,946 instances): The hardest of the bunch in my opinion. Agents need to retrieve relevant information from a document collection first, then use that information to make the right API calls. This is where the “reading comprehension meets tool use” problem really shows up.

Document-Grounded Tool Selection (1,761 instances): Similar to the above, but with a larger tool set and more emphasis on finding the right tool based on document context.

The Failure Modes That Matter

The VAKRA team analyzed where models fail, and the patterns are revealing:

Tool hallucination is rampant. Models invent API calls that don’t exist. Not just wrong parameters, but entirely fictional endpoints. This is terrifying for production deployments where an agent could theoretically call a non-existent endpoint and crash a system.

Chaining order failures. Agents pick the right tools but call them in the wrong sequence. The get_data() call has to come first, but models frequently try to filter or sort before they’ve loaded the data. This is a basic workflow issue that any junior developer would catch.

Context window mismanagement. When agents retrieve documents, they often dump everything into the context and then can’t find the relevant pieces. The retrieval step contaminates the reasoning step.

Over-reliance on tool descriptions. Models that could perform well with clear, concise tool names fell apart when descriptions were verbose or ambiguous. This suggests current agents are pattern-matching on tool names rather than understanding what tools do.

Why This Matters Right Now

Everyone’s rushing to deploy AI agents in enterprise settings. Every SaaS company is bolting on an “AI assistant” that promises to automate workflows. VAKRA suggests these systems are brittle in ways that won’t show up in demo environments.

The benchmark is executable – it actually runs the tool calls and checks whether the final result is correct. This is harder than evaluating on multiple-choice questions or generation quality. Execution traces don’t lie. If the agent calls the wrong API, the database returns garbage, and the benchmark knows.

I’ve seen enough agent demos that work perfectly on stage and fall apart in production. VAKRA formalizes what many of us have suspected: current LLM-based agents lack robust reasoning for multi-step tool use. The gap between “can answer questions about enterprise data” and “can reliably execute enterprise workflows” is enormous.

The Dataset Is Public

VAKRA is open source. The dataset, the leaderboard, the code – all on GitHub. If you’re building agents for enterprise use, you should be testing against this. The 8,000+ APIs and 62 domains cover enough ground that if your agent passes VAKRA, it’s probably not completely incompetent.

But here’s the thing: VAKRA tests specific patterns. It doesn’t test everything. Real enterprise workflows involve authentication, rate limiting, error handling, partial failures, and human-in-the-loop decisions. VAKRA is a solid stress test for reasoning and tool use, but it’s not a certification of production readiness.

Still, it’s the best we’ve got right now. And the results should make anyone deploying agents in production take a hard look at their evaluation pipeline.

VAKRA: Why Enterprise AI Agents Keep Tripping Over Simple Workflows

What VAKRA Actually Tests

The Failure Modes That Matter

Why This Matters Right Now

The Dataset Is Public

Comments (0)