Granite 4.1: How IBM Actually Built These Small But Mighty LLMs

IBM’s Granite team just dropped a detailed technical walkthrough of how they built the Granite 4.1 LLMs, and it’s refreshingly honest about what matters: data quality over quantity, staged training, and not pretending small models can’t punch up.

The headline stat: the 8B instruct model matches or beats the previous Granite 4.0-H-Small, which was a 32B MoE with 9B active parameters. That’s a dense 8B outperforming a much larger MoE. How? By being smart about data, not just throwing compute at the problem.

Architecture: Nothing Fancy, Just Solid Choices

Granite 4.1 uses a dense decoder-only transformer. No MoE tricks, no exotic attention. Just Grouped Query Attention, RoPE, SwiGLU activations, RMSNorm, and shared input/output embeddings. Three sizes: 3B, 8B, and 30B. All share the same training pipeline and data strategy—they just scale up layer count, hidden size, and MLP dimensions.

| Component | 3B | 8B | 30B |
|—|—|—|—|
| Embedding size | 2560 | 4096 | 4096 |
| Layers | 40 | 40 | 64 |
| Attention heads | 40 | 32 | 32 |
| KV heads | 8 | 8 | 8 |
| MLP hidden | 8192 | 12800 | 32768 |

8 KV heads for all sizes means the memory savings from GQA apply consistently. Smart.

Pre-Training: Five Phases, Each With a Purpose

This is where the real engineering lives. 15 trillion tokens, but not all at once. Five distinct phases, each with a different data mix and learning rate schedule.

Phase 1: General pre-training (10T tokens)

59% CommonCrawl (general web)
20% Code
7% Math
10.5% Technical (papers, docs)
2% Multilingual
1.5% Domain-specific

Standard starting point. Broad language understanding with a power LR schedule.

Phase 2: Math/Code pivot (2T tokens)
Math jumps 5x to 35%. Code goes to 30%. CommonCrawl drops to 12% and is now a high-quality subset. They add 9% synthetic data. This is where reasoning starts to get serious.

Phase 3: High-quality annealing (2T tokens)
Now we’re in mid-training territory. Exponential decay LR. The mix becomes more balanced: 16.67% each for CommonCrawl-HQ, Math, and Code. They start blending in 12.5% long chain-of-thought data and 12% instruction tuning data. This is unusual—most models don’t see instruction data until fine-tuning. Granite 4.1 bakes it into pre-training.

Phase 4: Refinement (0.5T tokens)
Linear decay to zero. Data quality is at its peak: 40% CommonCrawl-HQ, 20% Code, 20% Math, plus 14% instruction/reasoning data.

Phase 5: Long context extension (LCE)
They stretch from 4K to 512K tokens in three stages: 32K, then 128K, then 512K. The final stage uses 80% books + 20% code repo data (only for 8B and 30B). After each stage, they do a model merge to preserve short-context performance. The RULER benchmark results are solid: the 8B base model scores 83.6 at 32K, 79.1 at 64K, 73.0 at 128K. Not perfect, but respectable for a dense model.

Supervised Fine-Tuning: Quality Over Quantity

SFT data was curated using an LLM-as-Judge framework. They ended up with ~4.1M high-quality samples. No numbers on how much they filtered out, but given the emphasis on “progressive refinement,” I’d guess the raw pool was much larger.

Reinforcement Learning: GRPO + DAPO

The RL stage uses on-policy GRPO with DAPO loss (Yu et al., 2025). Multi-stage pipeline targeting math, coding, instruction following, and general chat. This is where the 8B model’s ability to beat the 32B MoE likely comes from—RL can squeeze a lot of performance out of a smaller model if the reward signal is clean and the training is stable.

Licensing: Apache 2.0

All three models are released under Apache 2.0. No strings attached, no commercial restrictions. That’s increasingly rare for capable small models.

My Take

The Granite 4.1 paper is a textbook example of how to build small LLMs right. The staged data curation, the early injection of instruction data into pre-training, the careful long-context extension with model merging—these are all techniques that other teams should copy.

The 8B beating the 32B MoE is the kind of result that makes you question whether MoE complexity is worth it for most use cases. Dense models are simpler to deploy, easier to quantize, and less prone to routing instability. If you can get comparable or better performance from a dense 8B, why bother with the MoE headache?

Granite 4.1 won’t dominate the LLM leaderboards, but it doesn’t need to. It’s a practical, well-engineered family of models that delivers where it counts: real-world performance in a deployable package.