There’s a problem with synthetic data that doesn’t get talked about enough: most generation methods are still pretty dumb. They rely on manual prompts, evolutionary algorithms, or a pile of seed data from the target distribution. That works fine for simple cases, but it breaks down when you need production-scale datasets with real control over what goes in.
Google Research just published a paper in Transactions on Machine Learning Research that tries to fix this. The framework is called Simula, and the core idea is refreshing: treat synthetic data generation as mechanism design at the dataset level, not the sample level. Instead of optimizing one data point at a time, Simula reasons about the whole dataset from first principles.
Why this matters now
Generalist AI models have gotten good because there’s tons of internet data. But real-world deployment needs specialization — think medical records, financial transactions, or cybersecurity logs. These domains are data-scarce, privacy-sensitive, or both. You can’t just scrape the web and call it done.
Relying on real-world data comes with three hard limits:
- Cost and accessibility: Hand-labeling specialized datasets is expensive, slow, and error-prone.
- Operational drag: Real-world data is static. A synthetic-first approach lets you treat data like code — versioned, reproducible, inspectable.
- Preparedness: You can’t wait for safety failures to happen in the wild. Synthetic data lets you generate edge cases proactively, stress-testing systems against scenarios that don’t exist yet.
Current synthetic data methods don’t solve these well. They limit scalability (need seeds or human effort), explainability (black-box evolution), and control (entangled parameters). Most critically, they optimize one sample at a time, not the dataset as a whole.
Simula’s approach: Reasoning-first, seedless, agentic
Simula flips the script. It uses reasoning models to construct entire datasets from scratch, without any seed data. The generation capability improves naturally as the underlying reasoning models get better. That’s a nice property — it means Simula doesn’t need its own scaling tricks; it rides the wave of better LLMs and reasoners.
The framework decomposes generation into four controllable axes. The first one is the most interesting:
Global Diversification: Instead of random sampling, Simula uses reasoning models to map the conceptual space of a target domain into deep, hierarchical taxonomies. This acts as a “sampling scaffold.” By defining sampling strategies over these taxonomies, you control global diversity — ensuring the dataset covers the long tail of a domain rather than clustering around common modes.
To build these taxonomies without human seed data, Simula runs a recursive “propose-and-refine” loop. At each depth level, the system generates multiple candidate sub-categories, then evaluates, merges, and filters them with a critic model. The result is a dense, hierarchical tree — like a Cyber Threat Intelligence taxonomy — that serves as the foundation for dataset diversity.
Once you have the taxonomy, the other axes come into play:
- Local Complexity: Controls the difficulty of individual samples. Simple for basic tasks, complex for edge cases.
- Semantic Fidelity: Ensures generated data matches the real-world semantics of the domain, not just surface patterns.
- Quality Assurance: Automated checks at the sample and dataset level to catch artifacts or inconsistencies.
What I like about this
Most synthetic data papers are about generating more data, not better data. Simula is explicitly about control. The idea of decomposing generation into independent axes — coverage, complexity, quality — is practical. In production, you don’t just want “more medical records.” You want records that cover rare diseases, vary in complexity, and look realistic.
The seedless aspect is also underrated. Many real-world domains don’t have clean seed data. If you’re building a dataset for a niche industrial process or a new regulatory framework, you’re starting from zero. Simula’s reasoning-first approach works there.
One thing I’m skeptical about: the reliance on the underlying reasoning model’s quality. If the reasoning model is weak, the taxonomy generation will be weak. The paper acknowledges this, but it’s still a dependency. In practice, this means Simula’s performance is gated by the best available reasoners, which changes fast.
The bigger picture
Simula fits into a broader shift I’ve been noticing: people are getting tired of brute-force data collection. The era of “just scrape more” is ending, especially for specialized domains. Mechanism design — thinking about incentives, constraints, and optimization at the dataset level — is a more mature framing.
Google’s paper is worth reading if you’re building synthetic data pipelines for production. It’s not a magic bullet, but it’s a serious attempt to move beyond the manual, sample-level thinking that dominates current tools.
And honestly, anything that reduces the number of times I have to hand-label edge cases is a win in my book.
Comments (0)
Login Log in to comment.
Be the first to comment!