Google Research asks: How many raters do you actually need for AI benchmarks?

Most AI benchmarks are built on a shaky foundation: a handful of human raters, usually 1 to 5 per item, whose labels get collapsed into a single “ground truth.” The problem is that humans disagree all the time, especially on subjective tasks like toxicity detection or hate speech moderation. Ignoring that disagreement doesn’t make it go away — it just hides the noise.

Google Research just published a paper called “Forest vs Tree: The (N,K) Trade-off in Reproducible ML Evaluation” that actually tries to quantify this problem. Flip Korn and Chris Welty built a simulator that stress-tests different annotation budgets, varying both the number of items (N) and the number of raters per item (K). Their goal: find the configuration that gives the most reproducible results for a fixed budget.

The title metaphor is solid. The “forest” approach spreads your budget thin — ask 1,000 people to each rate one item. The “tree” approach goes deep — have 20 people rate the same 50 items. Historically, AI evaluation has defaulted to the forest. This paper argues that’s often a mistake.

Here’s the kicker: when human disagreement is high, depth matters more than breadth. A single rater’s opinion is just noise. Five raters still might not capture the true distribution of opinions. The simulator showed that for many real-world datasets, you need substantially more raters per item than the field has been comfortable paying for.

But Google didn’t just say “spend more money.” They built an open-source simulator so you can run your own trade-off analysis. Plug in your budget, your expected disagreement rate, and the minimum effect size you care about detecting, and the simulator tells you the optimal N and K. That’s actually useful.

I’ve seen too many papers where authors proudly report “we used 3 annotators and took the majority vote” as if that’s rigorous. This paper shows that approach can be statistically indistinguishable from random when the task is subjective. The reproducibility crisis in AI isn’t just about code sharing — it’s about pretending human judgment is a single number.

One thing I appreciate is that the paper doesn’t pretend there’s a universal answer. The optimal trade-off depends on your task, your budget, and how much disagreement you expect. But the framework gives you a way to figure it out instead of guessing.

If you’re building a benchmark or running human evaluation, this is worth reading. The simulator is available on GitHub. Skip the forest when you need the tree.

Google Research asks: How many raters do you actually need for AI benchmarks?

Comments (0)