Google's Gemma 4 models use QAT to reduce on-device memory footprint

Created at 5 Jun · 20:50 UTC1 source↑ Market-relevant

IN SHORT

Google has released new Gemma 4 model checkpoints utilizing quantization-aware training (QAT) to significantly reduce their memory footprint and size. This method integrates quantization into the training process, improving performance and decode speeds compared to post-training quantization (PTQ), making the models more efficient for deployment on phones and laptops. The optimized models are available in five sizes.

Key Numbers

5Gemma 4 model sizes released

2-bitcompression in certain model parts

Who's Involved

Google

Developer of the Gemma 4 language models

Key facts

Google released new Gemma 4 model checkpoints with quantization-aware training (QAT).
QAT reduces the memory footprint and size of the models.
QAT results in better performance and faster decode speeds compared to post-training quantization (PTQ).
The compressed models are optimized for phones and laptops using a custom mobile-quantization schema.
The Gemma 4 models are available in five sizes: E2B, E4B, 12B, 26B A4B, and 31B.

Google has released new Gemma 4 model checkpoints that incorporate quantization-aware training (QAT). This training technique is designed to reduce the memory footprint and size of the models, making them more efficient for on-device deployment on phones and laptops. Unlike post-training quantization (PTQ), which can degrade performance, QAT integrates quantization into the training process itself, leading to better quality retention and accelerated decode speeds. Google states that this approach results in checkpoints with superior performance compared to those refined using PTQ. The optimization involves a custom mobile-quantization schema, including pre-calculated settings, 2-bit compression in specific model sections, and compression of vocabulary lists and short-term memory. The newly optimized Gemma 4 models are available in five sizes: Gemma 4 E2B, Gemma 4 E4B, Gemma 4 12B, Gemma 4 26B A4B, and Gemma 4 31B.

↳ Why This Matters

These advancements in model optimization enable more powerful AI capabilities to run directly on consumer devices, improving user experience and privacy by reducing reliance on cloud processing.

FREQUENTLY ASKED

Quantization-aware training (QAT) is a method where quantization, the process of reducing the precision of model weights and activations, is incorporated directly into the model training process. This helps to minimize the performance degradation often associated with quantization.

QAT integrates quantization during training, preserving model quality and accelerating decode speeds. Post-training quantization (PTQ) applies quantization after the model has been fully trained, which can sometimes lead to a reduction in performance.

The Gemma 4 models with QAT offer a reduced memory footprint and size, improved performance, and faster decode speeds, making them suitable for efficient on-device deployment on mobile phones and laptops.

Key facts

Google released new Gemma 4 model checkpoints with quantization-aware training (QAT).

QAT reduces the memory footprint and size of the models.

QAT results in better performance and faster decode speeds compared to post-training quantization (PTQ).

The compressed models are optimized for phones and laptops using a custom mobile-quantization schema.

The Gemma 4 models are available in five sizes: E2B, E4B, 12B, 26B A4B, and 31B.

Google's Gemma 4 models use QAT to reduce on-device memory footprint

PiQ Daily

Key facts

Google's Gemma 4 models use QAT to reduce on-device memory footprint

PiQ Daily

Key facts

Get the newsletter.