Gemma 4: Ushering in a New Era of Open Multimodal AI

Google DeepMind has officially released Gemma 4, a powerful family of open models. Unlike its predecessors, Gemma 4 is natively multimodal, capable of processing text and images across the board, with native audio support integrated into the lightweight models.

Gemma 4 is designed to democratize frontier AI by offering a variety of sizes—from on-device lightweight versions to server-grade powerhouses—and two distinct architectures (Dense and MoE), ensuring seamless deployment across everything from mobile phones and laptops to high-end GPUs.

1. Model Lineup: Flexible Deployment Options

Gemma 4 provides four distinct model sizes to fit various hardware constraints:

Dense Models

E2B (2.3B effective params): Ultra-lightweight, supports text, image, and audio. Specifically optimized for on-device use.
E4B (4.5B effective params): Lightweight, supports text, image, and audio, balancing performance and speed.
31B Dense: The flagship dense model, supporting text and image, providing top-tier reasoning capabilities.

Mixture-of-Experts (MoE) Model

26B A4B (3.8B active params): Utilizing the MoE architecture, it has a total of 26B parameters but only activates 3.8B during inference. This allows it to deliver performance close to the 31B model while running almost as fast as a 4B model.

Technical Highlight: PLE (Per-Layer Embeddings) In the E2B and E4B models, Google introduced Per-Layer Embeddings. This allows each decoder layer to have its own small embedding table, maximizing parameter efficiency and ensuring smoother execution on mobile devices without adding unnecessary layers.

2. Core Capability Breakthroughs

🧠 Native Reasoning (Thinking Mode)

Every model in the Gemma 4 family is designed as a high-capability reasoner. With the configurable Thinking Mode, the model can engage in step-by-step internal reasoning (Chain-of-Thought) before producing the final answer, significantly improving accuracy for complex logical tasks.

🖼️ Advanced Multimodal Understanding

Vision: Supports variable aspect ratios and resolutions. It excels at object detection, PDF/document parsing, UI understanding, and high-precision OCR.
Video: Capable of video understanding by processing sequences of frames.
Audio (E2B/E4B only): Native support for Automatic Speech Recognition (ASR) and speech-to-translated-text translation.

💻 Enhanced Coding & Agentic Capabilities

Gemma 4 shows marked improvements in code generation, completion, and correction. With native Function Calling support, it can power highly capable autonomous agents for complex workflows.

3. Architecture & Performance

Hybrid Attention Mechanism

Gemma 4 employs a hybrid design that interleaves Local Sliding Window Attention (SWA) with Full Global Attention (GA). This ensures the model maintains deep awareness for long-context tasks while benefiting from the processing speed and low memory footprint of a lightweight model.

Performance Benchmarks

Gemma 4 demonstrates frontier-level performance across key metrics:

MMLU Pro: The 31B model achieves 85.2%.
AIME 2026 (Math): The 31B model reaches 89.2% without tools.
LiveCodeBench v6 (Code): The 31B model scores 80.0%, a massive leap over previous versions.
Context Window: Small models support 128K, while medium/large models support up to 256K tokens.

4. Getting Started

Gemma 4 is fully integrated with the transformers library. Users can easily enable the reasoning process via the chat template:

# Enable thinking mode generation
text = processor.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True, 
    enable_thinking=True  # Key: Enable the internal thought process
)

Resources:

Gemma 4: Google DeepMind's Omnimodal Open Model Family