Key facts
- Google released new Gemma 4 model checkpoints with quantization-aware training (QAT).
- QAT reduces the memory footprint and size of the models.
- QAT results in better performance and faster decode speeds compared to post-training quantization (PTQ).
- The compressed models are optimized for phones and laptops using a custom mobile-quantization schema.
- The Gemma 4 models are available in five sizes: E2B, E4B, 12B, 26B A4B, and 31B.
Google has released new Gemma 4 model checkpoints that incorporate quantization-aware training (QAT). This training technique is designed to reduce the memory footprint and size of the models, making them more efficient for on-device deployment on phones and laptops. Unlike post-training quantization (PTQ), which can degrade performance, QAT integrates quantization into the training process itself, leading to better quality retention and accelerated decode speeds. Google states that this approach results in checkpoints with superior performance compared to those refined using PTQ. The optimization involves a custom mobile-quantization schema, including pre-calculated settings, 2-bit compression in specific model sections, and compression of vocabulary lists and short-term memory. The newly optimized Gemma 4 models are available in five sizes: Gemma 4 E2B, Gemma 4 E4B, Gemma 4 12B, Gemma 4 26B A4B, and Gemma 4 31B.