What is Gemma 4, and how is it different from previous Gemma models?

Gemma 4 is Google DeepMind's fourth-generation open model family — now fully multimodal, with four sizes that run from a smartphone to a server, all under Apache 2.0.

Google DeepMind released Gemma 4 on April 2, 2026. Unlike the earlier Gemma generations, which were text-only models targeting academic fine-tuning, Gemma 4 is a full multimodal family. The four models — E2B, E4B, 26B A4B, and 31B — handle text, images, audio, and video depending on size. They ship with built-in reasoning modes, context windows reaching 256,000 tokens, native function calling for agent workflows, and multilingual support trained on 140-plus languages.

The "E" prefix on the two smallest models stands for "effective" parameters. Both E2B and E4B use Per-Layer Embeddings (PLE), a technique that gives each decoder layer its own small embedding table. This makes the total parameter count larger on disk than the working compute count during inference, which is why E2B has 2.3 billion effective parameters but 5.1 billion including its embeddings. The result is a model that punches above its inference-time weight — it's efficient on mobile hardware without the accuracy hit that normally comes from aggressive parameter trimming.

The 26B A4B is a Mixture-of-Experts model. MoE means the model contains 128 specialised sub-networks (experts), but only 8 of them activate for any given token during inference. The model has 25.2 billion total parameters, but only 3.8 billion are active at once. It runs at roughly the speed of a 4B dense model while delivering accuracy closer to the full 26B. The dense 31B flagship — 30.7 billion parameters, no sparsity tricks — targets workstations and consumer GPUs where memory isn't the constraint.

What can Gemma 4 actually do?

Gemma 4 handles six capability categories across its four models: reasoning, long-context understanding, image analysis, audio processing, code generation, and agentic function calling.

All four Gemma 4 models include a configurable reasoning mode. Enable it and the model generates internal chain-of-thought before producing its final answer — similar in concept to OpenAI's o-series thinking or Anthropic's extended thinking in Claude. The implementation uses a dedicated channel token to separate the reasoning trace from the output, and it can be turned off entirely for latency-sensitive deployments.

On images, all four models handle variable aspect ratios and resolutions through a configurable visual token budget. You can trade image detail for speed — a token budget of 70 processes fast, a budget of 1,120 retains fine-grained OCR-level detail. The vision encoder in the smaller models totals about 150 million parameters; the 31B uses a more capable 550-million-parameter vision encoder. The two larger models — 26B A4B and 31B — do not support audio. If you need speech recognition or speech translation, E2B and E4B are your options. Both support a maximum of 30 seconds of audio input.

For code, the 31B model achieved a Codeforces ELO of 2,150 and an 80.0% score on LiveCodeBench v6, according to Google DeepMind's published benchmarks. That places it in competitive territory with top frontier models, including closed systems. Native function calling is built in, which means these models can operate as the reasoning core of an automated agent — reading tool outputs, deciding next steps, and driving multi-step workflows without a separate orchestration layer telling them what to do.

Model Effective Params Context Modalities AIME 2026 GPQA Diamond MMLU Pro
Gemma 4 E2B 2.3B (5.1B w/ embeddings) 128K Text, Image, Audio 37.5% 43.4% 60.0%
Gemma 4 E4B 4.5B (8B w/ embeddings) 128K Text, Image, Audio 42.5% 58.6% 69.4%
Gemma 4 26B A4B (MoE) 3.8B active / 25.2B total 256K Text, Image 88.3% 82.3% 82.6%
Gemma 4 31B 30.7B 256K Text, Image 89.2% 84.3% 85.2%

Source: Google DeepMind / HuggingFace model card, April 2 2026. AIME 2026 scores are "no tools" configurations.

How does Gemma 4 benchmark against closed models?

The 31B model's 89.2% on AIME 2026 and 84.3% on GPQA Diamond land it in the same tier as the best closed frontier models, based on publicly available benchmark comparisons.

AIME 2026 is the American Invitational Mathematics Examination — a competition-level math problem set that requires multi-step symbolic reasoning, not pattern matching. A score of 89.2% places the 31B model at the frontier of publicly tested reasoning capability. GPQA Diamond is a 198-question benchmark verified by domain experts in biology, physics, and chemistry — the kind of subject-matter reasoning where memorisation breaks down fast. An 84.3% score signals the model can handle technical knowledge synthesis, not just retrieval.

The story gets more interesting when you look at the MoE model. The 26B A4B scores 88.3% on AIME 2026 and 82.3% on GPQA Diamond — one to two points behind the 31B — while using only 3.8B active parameters at inference. For anyone running this on an API server or a fast GPU, that efficiency gap translates directly to cost and latency. You're getting near-frontier reasoning at the compute cost of a small dense model.

Even the smallest model, E2B, scores 60% on MMLU Pro. MMLU Pro tests professional-level knowledge across 57 fields. A 2.3B effective parameter model reaching 60% would have been considered a notable research result two years ago. Today it's the entry-level SKU of a free open-source release.

Google DeepMind also published a BigBench Extra Hard score of 74.4% for the 31B, an MMMLU multilingual score of 88.4%, and an HLE score of 19.5% (26.5% when search access is enabled). The HLE — Humanity's Last Exam — is among the most difficult benchmarks currently in use. A score above 25% with search is rare among any model at any size.

Why does Apache 2.0 licensing matter for developers and enterprises?

Apache 2.0 means no API lock-in, no usage restrictions, and no vendor permission required — developers and companies can deploy, modify, and redistribute Gemma 4 freely, including in commercial products.

The licensing context matters because not all "open" AI releases are equally open. Some use custom licenses that prohibit commercial use above a revenue threshold, restrict fine-tuning, or require the original developer's approval for certain applications. Apache 2.0 carries no such conditions. You can take the 31B weights, fine-tune them on your proprietary data, deploy them inside a product you charge customers for, and redistribute the result — without telling Google. That's a meaningfully different category than models that call themselves "open" but include exclusions for high-revenue commercial use.

For enterprises specifically, Apache 2.0 eliminates the data-privacy risk that comes with sending inputs to a third-party API. Running Gemma 4 locally means your prompts, documents, and outputs never leave your infrastructure. For healthcare, finance, and legal organizations operating under data residency requirements, that distinction can be the difference between a deployable tool and a non-starter.

For independent developers, the equation is simpler: at the performance level the 26B A4B achieves, running it locally or on a modest cloud GPU instance costs a fraction of a closed frontier API. The per-query math changes completely when you own the weights.

How do you run Gemma 4 today?

Gemma 4 works with Hugging Face Transformers, llama.cpp, MLX for Apple Silicon, and browser-based WebGPU.

The fastest path for most developers is a few pip installs and six lines of Python.

The Transformers path is the most complete. Install transformers, torch, and accelerate, load the model with AutoModelForCausalLM, and call processor.apply_chat_template with enable_thinking=True or enable_thinking=False depending on whether you want the internal reasoning trace. The model handles the thinking tokens automatically — you don't need to parse them out manually if you use the processor's parse_response function.

For Apple Silicon (M1 through M4 Pro), the MLX path is more performant. The MLX community has already pushed quantized versions of both the E4B and 26B A4B models to HuggingFace with 4-bit and 8-bit configurations. The E4B runs on a 16GB MacBook Pro M4. The 26B A4B fits in 32GB with 4-bit quantization. The 31B dense model needs at least 64GB of unified memory at full precision, or a quantized variant on a 32GB Max chip.

For llama.cpp, quantized GGUF versions of all four models are already available through community uploads as of April 3, 2026. The 26B A4B in Q4_K_M configuration runs on a single consumer GPU (24GB VRAM) with acceptable speed. For browser-based deployment, Google has already demonstrated Gemma 4 running via WebGPU in Chrome — similar to how they shipped Gemma 3 in a browser tab. A live demo is available at the HuggingFace Spaces link below.

Nexairi Analysis: The Widening Gap Between Open and Closed

The timing here is not subtle. OpenAI just closed a $122 billion funding round — the largest in startup history — at a reported $300 billion valuation. That capital buys private compute, proprietary data, and infrastructure that independent developers can't replicate. The bet is that scale produces capability that no open competitor can match.

Gemma 4 is a direct data point against that thesis. Google DeepMind published a 31B model with AIME 2026 scores in the same tier as frontier closed systems — and made it free. The 26B MoE model runs as fast as a 4B model at inference, which means the compute cost of frontier-class reasoning is dropping in real time, for anyone.

This doesn't mean the closed frontier is irrelevant. Models at the absolute top of the capability curve — the ones used for complex, multi-hour agentic tasks — still belong to OpenAI and Anthropic for now. But the gap between "open best" and "closed best" is shrinking on every benchmark cycle. The realistic question isn't whether open models will catch up. It's whether closed providers can sustain the capital returns that justify $122 billion before they do.

For enterprise buyers, the calculus is already shifting. If Gemma 4 gives you 90% of the capability at 10% of the cost — and keeps your data on-premise — the burden of proof now falls on closed providers to justify the premium. That's a different market dynamic than existed 18 months ago.

Sources

Google DeepMind Open Source AI Multimodal Models On-Device AI LLM Benchmarks