If you’ve been following Arabic LLM evaluation for a while, you’ve probably noticed something weird. The number of benchmarks keeps growing, but nobody seems to be asking the obvious question: are we actually measuring the right things?
That’s the problem a team from TII (Technology Innovation Institute) decided to tackle with QIMMA (قمّة, Arabic for “summit”). Instead of just throwing models at existing benchmarks and declaring winners, they built a quality validation pipeline that runs before any evaluation happens. What they found is honestly a bit embarrassing for the field.
The Mess We’re In
Arabic NLP evaluation has been a mess for years. Here’s what’s been bugging anyone who actually works with Arabic language models:
Translated benchmarks are everywhere. Someone takes an English benchmark, runs it through Google Translate, and calls it an Arabic evaluation. The result is questions that feel like they were written by someone who learned Arabic from a phrasebook. Cultural context gets lost, idioms become nonsense, and the whole thing stops measuring actual Arabic capability.
Even native Arabic benchmarks skip quality checks. You’d think benchmarks created by Arabic speakers would be better, but they often aren’t. Annotation inconsistencies, wrong gold answers, encoding errors, and cultural bias in ground-truth labels are common. People just don’t check.
Nobody shares their outputs. Evaluation scripts and per-sample results are rarely public. So when someone claims their model scores 85% on some Arabic benchmark, you can’t verify it. You can’t even see which questions it got wrong.
Coverage is fragmented. Existing leaderboards focus on isolated tasks. One tests reading comprehension, another tests translation, but nobody’s looking at the full picture.
QIMMA sits in a different category. It’s the only Arabic leaderboard that’s fully open source, uses 99% native Arabic content, applies systematic quality validation, includes code evaluation, and publishes per-sample outputs. That’s a lot of firsts.
What’s Actually Inside
The team consolidated 109 subsets from 14 source benchmarks into a unified evaluation suite with over 52,000 samples across 7 domains:
- Cultural: AraDiCE-Culture, ArabCulture, PalmX – testing how well models understand Arab culture
- STEM: ArabicMMLU, GAT, 3LM STEM – math, science, and technical knowledge
- Legal: ArabLegalQA, MizanQA – legal reasoning and knowledge
- Medical: MedArabiQ, MedAraBench – healthcare and medical knowledge
- Safety: AraTrust – how models handle sensitive content
- Poetry & Literature: FannOrFlop – Arabic literary tradition
- Coding: 3LM HumanEval+, 3LM MBPP+ – programming with Arabic problem statements
The code evaluation piece is worth highlighting. This is the first Arabic leaderboard to include coding benchmarks, using Arabic-adapted versions of HumanEval+ and MBPP+. That’s a big deal for anyone who wants to use Arabic LLMs for actual development work.
The Quality Pipeline That Changes Everything
Here’s where QIMMA earns its credibility. Before evaluating a single model, every sample in every benchmark goes through a multi-stage validation process.
Stage 1: Two-model automated assessment. Each sample is independently scored by Qwen3-235B-A22B-Instruct and DeepSeek-V3-671B. They chose two models with strong Arabic capabilities but different training data, so their combined judgment is more robust than either alone. Each model scores samples against a 10-point rubric with binary criteria. A sample is eliminated if either model scores it below 7 out of 10. If both models agree on elimination, it’s dropped immediately. If only one flags it, it goes to human review.
Stage 2: Human annotation. Native Arabic speakers review flagged samples. They make final calls on cultural context, dialectal nuance, and subjective interpretation. For culturally sensitive content, multiple perspectives are considered because “correctness” genuinely varies across Arab regions.
What They Actually Found
This is the part that should make benchmark creators uncomfortable. The pipeline revealed systematic quality problems across established benchmarks. Not isolated errors, but patterns.
Translation artifacts were everywhere. Questions that were perfectly clear in English became ambiguous or nonsensical in Arabic. Cultural references that worked in Western contexts fell flat or were actively misleading in Arab contexts. Some benchmarks had encoding errors that corrupted entire subsets.
More troubling was the discovery of incorrect gold answers. Questions where the “correct” answer was demonstrably wrong, but nobody had caught it because nobody was checking. This isn’t just academic nitpicking. When you’re using these benchmarks to decide which model to deploy in production, bad data leads to bad decisions.
The team hasn’t published the full list of eliminated samples yet, but the paper is available on GitHub for anyone who wants to dig into the details.
What This Means for Practitioners
If you’re building applications with Arabic LLMs, QIMMA gives you something you didn’t have before: a leaderboard you can actually trust. The quality validation means the scores reflect genuine Arabic language capability, not the ability to guess through poorly translated questions.
The per-sample outputs are a game-changer for reproducibility. You can see exactly which questions each model got right or wrong, which means you can audit results and build on prior work. That’s basic scientific practice, but it’s been missing from Arabic NLP until now.
The code evaluation component is particularly useful for practical applications. If you’re building Arabic-language coding assistants or developer tools, you can now evaluate how well models handle programming tasks with Arabic problem statements. That’s been a blind spot in existing benchmarks.
The Bottom Line
QIMMA is a much-needed reality check for Arabic NLP evaluation. The team found that even widely-used, well-regarded benchmarks have quality issues that can quietly corrupt evaluation results. The fact that they’re open-sourcing everything and publishing per-sample outputs sets a new standard for transparency.
Is it perfect? No. The validation pipeline relies on automated assessment from two specific models, which introduces its own biases. The human annotation stage is resource-intensive and hard to scale. And the leaderboard only covers 14 source benchmarks, so there’s plenty of room for expansion.
But it’s a significant step forward. For anyone working with Arabic LLMs, QIMMA is worth paying attention to. The rankings might surprise you, and that’s exactly the point.
Comments (0)
Login Log in to comment.
Be the first to comment!