Key facts
- Inception Labs' Mercury 2 AI model generates approximately 1,000 tokens per second.
- Mercury 2 scored 90% on the AIME 2026 mathematics benchmark.
- Google's DiffusionGemma scored 69.1% on the same AIME 2026 benchmark.
- Mercury 2 achieved a 77% score on the GPQA benchmark, compared to DiffusionGemma's 73.2%.
- Mercury 2 is a paid, closed-weight API model, while DiffusionGemma is free and open-weight.
Inception Labs has launched its Mercury 2 AI model, which it claims is the world's fastest reasoning language model, capable of generating approximately 1,000 tokens per second. This speed places it in a similar performance bracket to Google's recently announced DiffusionGemma. Both models utilize parallel generation techniques, diverging from traditional sequential processing, to achieve higher speeds.
However, Mercury 2 has demonstrated superior performance on key benchmarks. On the AIME 2026 mathematics test, Mercury 2 achieved a score of 90%, significantly outperforming Google's DiffusionGemma, which scored 69.1%. On the GPQA, a PhD-level science benchmark, Mercury 2 scored 77% compared to DiffusionGemma's 73.2%. Google's own documentation suggests its standard Gemma 4 model performs better than DiffusionGemma on quality metrics.
Independent evaluations, such as a case study with AI coding-agent company Augment Code, show Mercury 2 offering substantial improvements in latency and cost reduction when used as a replacement for other models, while maintaining output quality. Inception Labs, founded by Stanford professor Stefano Ermon, has secured backing from notable investors including Nvidia's venture arm, Andrew Ng, and Andrej Karpathy.
The parallel diffusion approach allows AI systems to feel more responsive, enabling rapid iterations and efficient operation of multiple specialized AI agents within a larger system. While Mercury 2 is a closed-weight, API-based model and its ecosystem is still developing, its performance on commodity GPUs suggests significant potential for cost and energy savings at scale, particularly for speed-sensitive applications like real-time coding and voice interfaces.
