Photo by Alexandre Debiève on Unsplash
Gemma 4 Beat Models 20x Its Size. Google Gave It Away.
/ 15 min read
Table of Contents
The 31-billion-parameter model that rewrote the leaderboard
Google DeepMind dropped Gemma 4 on April 2, and within 48 hours the AI community’s tidy assumptions about model scale collapsed into a pile of outdated PowerPoint slides. The flagship 31-billion-parameter dense model landed at #3 on the Arena AI text leaderboard with an Elo score of 1452, beating systems with 600 billion or more parameters on reasoning and mathematical benchmarks. The 26-billion-parameter mixture-of-experts variant secured the #6 spot at 1441 — using only 3.8 billion active parameters at inference time. A model that activates fewer parameters than a mid-2023 chatbot is now outperforming frontier systems that require clusters of eight-GPU machines to run. That sentence alone should give every AI infrastructure budget holder a migraine.
The family spans four variants. At the top sit the 31B dense model and the 26B MoE, both multimodal, both capable of ingesting text, images, video, and audio within a 256,000-token context window, and both fluent in more than 140 languages. At the bottom — and this is where Google’s ambitions get genuinely interesting — sit the E2B and E4B “Effective” models, shrunken to 2.3 and 4 billion parameters respectively, designed to run entirely on smartphones with native audio support. The E2B fits inside 1.5 gigabytes of RAM on some devices using aggressive quantization. A frontier-class reasoning model that runs on the same silicon powering your weather app. That is not an incremental improvement. That is a category redefinition.
But the metric that will echo loudest through enterprise procurement offices and startup pitch decks is not a benchmark score. It is two words: Apache 2.0. For two years, enterprises evaluating open-weight AI models navigated a licensing minefield. Meta’s Llama carries a community license that restricts usage above 700 million monthly active users — a threshold that sounds generous until your platform serves a global customer base or your legal team starts asking uncomfortable questions about derivative works. Google’s own prior Gemma releases shipped under a custom license that introduced just enough friction to slow adoption in regulated industries. Gemma 4 scraps all of it. Apache 2.0 means no restrictions on commercial use, no user caps, no clauses about competing products. Google is matching the licensing posture of Alibaba’s Qwen and Mistral while offering models that, by multiple measures, outperform both. The open-model race in April 2026 is being won on licensing terms as much as on benchmark points, and Google just removed every legal objection a Fortune 500 general counsel could raise.
Since the first Gemma generation launched, developers have downloaded Gemma models more than 400 million times, generating a community of over 100,000 fine-tuned variants on Hugging Face, Kaggle, and Ollama. That installed base is now inheriting a model that scores 89.2 percent on AIME 2026 — the competitive math benchmark that separates toys from tools — compared to 20.8 percent for its predecessor, Gemma 3 27B. The question is no longer whether small open models can compete with proprietary giants. The question is whether the giants can justify their price tags.
Follow the benchmarks, find the business model
Numbers first. The Gemma 4 31B dense model posts 85.2 percent on MMLU Pro, 84.3 percent on GPQA Diamond — a benchmark designed to require genuine domain expertise, not pattern matching — and a Codeforces ELO of 2150, up from 110 in Gemma 3. That last figure deserves its own paragraph. A jump from ELO 110 to 2150 means the model went from performing at the level of someone who just learned what a for-loop is to competing with the top one percent of competitive programmers on Earth. The 26B MoE, activating only 3.8 billion parameters per forward pass, scores 88.3 percent on AIME 2026 and occupies the sixth spot on the Arena leaderboard. Parameter count, the metric that defined the first wave of the AI arms race, is now officially a vanity number. What matters is intelligence per watt, intelligence per dollar, intelligence per gigabyte of RAM.
The edge models tell an equally important story. Google’s E2B and E4B variants are purpose-built for deployment on smartphones, tablets, and embedded devices. The E2B processes text, images, and audio natively while fitting inside memory envelopes that would have been unthinkable for a multimodal model twelve months ago. Google has already integrated these models into Android AICore, the system service that manages on-device model delivery and hardware acceleration, with support for specialized AI accelerators from Google, MediaTek, and Qualcomm. The consumer-facing implementation, Gemini Nano 4, built on the Gemma 4 framework, promises to be four times faster and use 60 percent less battery than its predecessor. On-device AI has been a marketing buzzword for three years. Gemma 4 is the first release that makes it a plausible production capability for third-party developers building on Android.
The strategic logic behind giving this away starts in Google Cloud’s balance sheet. Cloud revenue grew 48 percent year-over-year to $17.7 billion in Q4 2025, with an annual run rate exceeding $70 billion and a contracted backlog that exploded 55 percent quarter-over-quarter to $240 billion. Revenue from products built on Google’s generative AI models grew more than 200 percent year-over-year in the most recent quarter. Nearly 75 percent of Google Cloud customers have adopted the company’s vertically optimized AI, and those customers use 1.8 times as many products as non-AI customers. The flywheel is elegant: ship the best open model under the most permissive license, attract a community of developers who build on it, and monetize the subset that deploys at scale through Vertex AI, Cloud Run, and Google Kubernetes Engine. Every Gemma download is a top-of-funnel event for Google Cloud sales.
The licensing shift is particularly devastating for Meta’s competitive position. Llama 4, released earlier this year, remains under Meta’s community license, which imposes a 700-million monthly active user threshold above which commercial use requires a separate agreement. For startups and mid-market companies, that cap is irrelevant. For hyperscalers, telecommunications providers, and global consumer platforms — exactly the customers that generate the largest cloud infrastructure bills — it introduces legal uncertainty that Apache 2.0 eliminates entirely. Alibaba’s Qwen 3.5 already ships under Apache 2.0, and Chinese labs collectively hold most of the top positions on open-weight leaderboards. Google’s licensing move is less about generosity and more about ensuring it competes on the same legal footing as its most aggressive rivals in what has become a three-way race between Mountain View, Hangzhou, and Paris. The broader market for open-source AI models is projected to grow at a 15.1 percent CAGR through 2034, and roughly 66 percent of developers already use open-source models in production. Google is not giving away the product. Google is giving away the acquisition cost for its real product.
Here is the proprietary quantitative insight that emerges when you stitch these numbers together: Gemma 4’s 26B MoE achieves 88.3 percent on AIME 2026 while activating only 3.8 billion parameters — roughly 0.6 percent of the parameter footprint of a 600B+ frontier model. If you price cloud inference by the number of active parameters processed per token, the cost-per-unit-of-intelligence for the 26B MoE is approximately 160 times cheaper than running a full 600-billion-parameter dense model at equivalent accuracy on mathematical reasoning tasks. That ratio will compress as larger models optimize, but the order-of-magnitude advantage is structural, not incidental. MoE architectures with aggressive sparsity are rewriting the economics of AI inference.
The speed tax and the Chinese wall
Before the champagne corks fly, two problems demand honest accounting. The first is speed. Within 24 hours of launch, developers running Gemma 4 locally reported inference speeds that ranged from disappointing to disqualifying. The 26B MoE model produced roughly 11 tokens per second on GPUs where Alibaba’s Qwen 3.5 generated 60 or more tokens per second — a five-to-one throughput deficit that no amount of benchmark glory can paper over. The dense 31B fared better at 18 to 25 tokens per second on dual GPUs, but for interactive applications where users expect sub-second first-token latency, that remains marginal at best.
The root cause is architectural. Gemma 4 uses heterogeneous attention head dimensions that force popular inference engines like vLLM to disable FlashAttention and fall back to a slower Triton attention kernel. On the E4B edge model, this incompatibility produced throughput of just 9 tokens per second on an RTX 4090 — a card capable of pushing Llama 3.2 3B at over 100 tokens per second. The community has already begun patching: memory usage has been reduced by nearly 40 percent in context-heavy scenarios through third-party optimizations, and the 26B MoE can now fit within the 24 gigabytes of VRAM on a consumer RTX 3090 or 4090 with recent llama.cpp fixes. But the gap between Google’s optimized cloud deployment and the community’s local experience is wide enough to create real frustration among the independent developers who form the backbone of the Gemmaverse. Google ships benchmarks measured on its own infrastructure. Developers ship products on the hardware they can actually afford.
The second and more structural challenge is the competitive reality of the open-model landscape in April 2026. Chinese laboratories — Zhipu AI, Alibaba, Moonshot AI, and DeepSeek — hold most of the top positions among open-weight models on global leaderboards. Zhipu AI’s GLM-5 (Reasoning) leads BenchLM.ai’s overall rankings at 82, with Alibaba’s Qwen 3.5 397B at 77 and GLM-5 standard at 75. Gemma 4 31B broke into the top five on Arena AI, but on MMLU Pro — the benchmark most closely correlated with general enterprise utility — Qwen 3.5 27B edges it out at 86.1 percent versus 85.2 percent. On GPQA Diamond, the gap is similar: 85.5 percent for Qwen versus 84.3 percent for Gemma. Google wins decisively on math competition benchmarks and competitive programming, but those victories matter most to researchers and specialized agents, not to the enterprise procurement teams evaluating models for customer service, document analysis, and code generation at scale.
There is also a question of whether benchmarks translate to production value at all. The AI industry has spent three years optimizing for leaderboard positions that may not correlate with the messy, ambiguous, context-dependent tasks that constitute real work. A model that scores 89 percent on AIME 2026 can solve competition-grade calculus problems, but the median enterprise AI use case involves summarizing a 40-page contract, generating a SQL query from a natural language question, or drafting an email that doesn’t sound like it was written by a robot. Arena AI’s human-preference rankings offer the best proxy for real-world utility, and Gemma 4’s #3 position there is genuinely impressive. But the dense 31B’s 0.54-tokens-per-second throughput on some configurations renders it impractical for the interactive use cases where human preference matters most. Benchmarks measure what a model can do. Latency determines what a model will do.
The memory story adds another wrinkle. Gemma models have historically been VRAM-hungry for long-context workloads, and Gemma 4 continues that pattern. Running the 31B dense model with a full 256K context window requires hardware configurations that push well beyond what most independent developers and small companies can provision. Google’s answer — deploy on Vertex AI — is rational from a business perspective but undermines the open-source ethos that makes Gemma attractive in the first place. If the model is truly open but only performs well on the vendor’s cloud, the openness becomes a marketing gesture rather than a technical reality.
The operator’s playbook for a post-parameter world
Despite the speed tax and the competitive pressure from Chinese labs, Gemma 4 marks a genuine inflection point. The era of “bigger model wins” is over. The era of “smarter architecture plus smarter licensing plus smarter distribution wins” has begun, and Google is better positioned to play that game than any other company on Earth. It has the research lab (DeepMind), the cloud platform (Google Cloud), the mobile operating system (Android with 3.3 billion active devices), the developer ecosystem (400 million Gemma downloads), and now the licensing posture (Apache 2.0) to convert open-model dominance into durable commercial advantage.
For the broader industry, Gemma 4 accelerates three trends that will define the next twelve months of AI deployment. First, on-device AI becomes real. Not “real” in the way that marketing teams have been claiming since 2023, but real in the sense that a developer can build a multimodal agent that processes text, images, and audio entirely on a user’s phone without a network connection, using a model that fits in 1.5 gigabytes of RAM and genuinely reasons about complex tasks. The E2B and E4B models, combined with Android AICore integration and hardware acceleration from MediaTek and Qualcomm, create a deployment path that did not exist six months ago. Privacy-sensitive applications in healthcare, finance, and government that could not send data to the cloud now have a viable alternative.
Second, the licensing wars reshape enterprise procurement. With Gemma 4 on Apache 2.0, Qwen on Apache 2.0, and Mistral on Apache 2.0, the holdout is Meta’s Llama license. Expect pressure on Meta to follow suit or risk losing enterprise mindshare to competitors who have already removed the friction. For enterprises evaluating open models, the procurement question has shifted from “which model scores highest?” to “which model scores well enough while introducing zero legal risk and running efficiently on our existing infrastructure?” Gemma 4’s combination of competitive benchmarks, permissive licensing, and Google Cloud integration makes it the default answer for any organization already in Google’s ecosystem — and a serious contender for those that are not.
Third, the MoE efficiency revolution that Gemma 4 demonstrates will propagate across every layer of the stack. The 26B model achieving top-six Arena performance with 3.8 billion active parameters is a proof point that will accelerate investment in sparse architectures, speculative decoding, and hybrid attention mechanisms. If intelligence per active parameter continues to improve at this rate, the cost of running frontier-quality inference will drop by an order of magnitude within 18 months, collapsing the pricing advantage of closed API providers and making self-hosted open models economically rational for a much larger segment of the market. The cloud AI developer services market, projected to grow at 23.6 percent CAGR through 2030 to $55 billion, will increasingly be driven by organizations deploying open models on managed infrastructure rather than purchasing tokens from closed APIs.
The operator checklist for anyone evaluating Gemma 4 this week looks like this:
- Benchmark against your actual workload, not against AIME. Gemma 4 excels at math, reasoning, and competitive programming. If your use case is document summarization or conversational AI, run your own evaluation suite before committing. Arena AI’s #3 ranking is encouraging, but your customers’ satisfaction is the only leaderboard that pays invoices.
- Budget for inference optimization. The out-of-the-box speed on consumer hardware is subpar. Plan for quantization tuning, vLLM patches, or Google Cloud deployment depending on your latency requirements. The model is capable. Making it fast enough requires engineering investment.
- Audit your licensing exposure today. If you are running Llama under Meta’s community license and your platform is approaching the 700-million-user threshold — or if your legal team simply wants to eliminate ambiguity — Gemma 4 under Apache 2.0 is the cleanest alternative available.
- Test the edge models for on-device use cases. The E2B and E4B models with native audio support represent capabilities that no other open model family offers at those sizes. If you are building for mobile, IoT, or air-gapped environments, these deserve a prototype sprint this quarter.
- Watch the Chinese labs. Qwen 3.5, GLM-5, and Kimi K2.5 are not second-tier competitors. On several benchmarks they outperform Gemma 4, and their iteration speed is accelerating. Any open-model strategy that ignores Hangzhou, Beijing, and Shenzhen is planning with one eye closed.
Google gave away its best open model. The strategy is not altruism. The strategy is that every developer who builds on Gemma 4 is one step closer to deploying on Google Cloud, one step closer to integrating with Android, and one step further from Meta’s ecosystem. In a market where 66 percent of developers already use open-source AI models and 75 percent plan to expand their usage, owning the default open model is worth more than any API margin. The parameter wars are over. The platform wars are just beginning.
In other news
PrismML emerges from stealth with 1-bit Bonsai LLMs — PrismML, founded by Caltech researchers with a $16.25 million seed round, released the first commercially viable 1-bit large language models. The Bonsai 8B model fits in just 1.15 GB of memory, runs at 44 tokens per second on an iPhone 17 Pro Max, and is 14 times smaller than its full-precision equivalent while remaining competitive on reasoning benchmarks (The Register).
Cursor 3 ships an agent-first workspace — Anysphere launched Cursor 3 with a fundamentally redesigned interface centered on AI agents rather than traditional code editing. The new workspace supports multi-repo layouts, seamless handoff between local and cloud agents, and integrates with Slack, GitHub, and Linear, signaling a shift from “AI-enhanced editor” to “human architect, agent builders” (SiliconANGLE).
OpenAI acquires TBPN talk show for hundreds of millions — OpenAI purchased TBPN, a daily live tech talk show hosted by founders John Coogan and Jordi Hays, in its first media acquisition. The show generated roughly $5 million in ad revenue in 2025 and is on track for $30 million this year, and will report to OpenAI’s chief political operative Chris Lehane while maintaining editorial independence (TechCrunch).
Alibaba drops Qwen 3.6-Plus with 1M-token context and agentic coding — Alibaba released Qwen 3.6-Plus, a hybrid linear-attention and sparse MoE model with a million-token context window that matches Claude Opus 4.5 on SWE-bench and Terminal-Bench 2.0. The model autonomously plans, tests, and iterates on code for production-ready solutions, available at $0.29 per million input tokens (Caixin Global).
Goldman Sachs projects 49% semiconductor revenue surge from AI demand — A Goldman Sachs report published April 5 forecasts that artificial intelligence adoption will drive global semiconductor revenues up 49 percent from current levels by end of 2026, with AI-specific chips representing the fastest-growing segment as hyperscalers and enterprises accelerate deployment across data centers and edge devices (ANI News).