skip to content
Stephen Van Tran
Table of Contents

The one-token-at-a-time orthodoxy just cracked

Every large language model you have ever used writes the way a medieval scribe did: one symbol at a time, left to right, each stroke waiting on the last. On June 10, Google shipped the first serious open-weights challenge to that orthodoxy. DiffusionGemma, an experimental 26-billion-parameter model released under Apache 2.0, abandons sequential decoding entirely. It drafts a 256-token block as noise and then resolves the whole canvas at once, the way ink blooms through water and settles into form — delivering up to 4x faster generation and crossing 1,000 tokens per second on a single NVIDIA H100.

The mechanism matters more than the demo. Autoregressive models — every GPT, every Claude, every standard Gemma — predict the next token, append it, and run the entire network again, per SiliconANGLE’s coverage of the release. DiffusionGemma instead generates placeholder text and iteratively replaces subsets of tokens with contextually appropriate words until the block converges. “DiffusionGemma changes this by shifting how models use hardware,” Google researchers Brendan O’Donoghue and Sebastian Flennerhag wrote — and that sentence is the thesis of the entire release. The bottleneck moves from memory bandwidth to raw compute, and GPUs have far more spare compute than spare bandwidth.

The stakes are larger than one experimental model. Text diffusion has lived in the research wing since Inception Labs shipped Mercury, the first commercial-scale diffusion LLM, in February 2025, and Google teased Gemini Diffusion at I/O that May before going conspicuously quiet. What changed on June 10 is distribution: the weights now sit on Hugging Face under a permissive license, quantized to fit in 18GB of VRAM, runnable on a consumer RTX 5090 at 700+ tokens per second. A paradigm that was a startup’s proprietary bet and a demo behind Google’s glass is now a thing any developer can fine-tune on a gaming PC.

The conceptual lineage is borrowed from images, and the borrowing is the point. Diffusion is the technique behind Stable Diffusion, Midjourney, and every modern image generator: start from pure noise, then iteratively denoise toward a coherent picture, refining the whole composition at every step rather than painting pixel by pixel. Applying it to text required solving a genuinely hard problem — words are discrete symbols, not continuous pixel values — which is why autoregression has monopolized language for eight years while diffusion conquered every other modality. DiffusionGemma’s “Uniform State Diffusion” mechanism is Google’s answer to that discreteness problem, and shipping it as open weights amounts to publishing the answer key.

And the timing is pointed. The frontier labs are racing upmarket — Anthropic just shipped Fable 5 at $50 per million output tokens behind safety classifiers, and Google’s own Gemini 3.5 Flash chases agentic benchmarks at cloud scale. DiffusionGemma runs the opposite direction: free, local, fast, and deliberately imperfect. Google states plainly that its output quality trails standard Gemma 4. The company is not claiming a better model. It is claiming a different physics — and open-sourcing the evidence so the ecosystem can decide whether the trade is worth it.

Ink in water: how a 256-token canvas beats the memory wall

Start with the architecture, because it is a clever act of recycling. DiffusionGemma is built directly on the Gemma 4 26B-A4B backbone — a mixture-of-experts design with 26 billion total parameters and only 3.8 billion active per inference step, per Google’s model documentation. It accepts text, image, and video inputs, carries a 256K-token context window, and covers 140+ languages. Rather than train a diffusion model from scratch, Google converted an autoregressive one: the prompt is processed with standard causal attention, then generation switches to bidirectional attention over a 256-token “canvas” that the model denoises in parallel, per MarkTechPost’s technical breakdown.

The speed numbers hold up across independent writeups. Google’s developer guide reports 1,000+ tokens per second on a single H100 and 700+ on a GeForce RTX 5090, with the up-to-4x speedup over standard Gemma 4 attributed to “compute-bound parallel generation” that sidesteps the memory-bandwidth limit governing sequential decoding. The New Stack’s analysis confirms the 4x figure against Google’s own Gemma family. NVIDIA, never shy about a hardware-flattering result, shipped day-one acceleration through its RTX AI Garage program, alongside vLLM-native serving, MLX and Transformers support, Unsloth and NeMo fine-tuning paths, and deployment through Google Cloud’s Model Garden and NVIDIA NIM.

Why does moving the bottleneck matter so much? Because sequential decoding wastes the hardware it runs on. Generating one token requires streaming the model’s active weights from GPU memory through the compute units — and then doing it again for the next token, and the next. For a single user, the GPU’s arithmetic units sit mostly idle, throttled by how fast memory can feed them; industry estimates put typical utilization in single-user decoding in the low single digits. DiffusionGemma’s block-parallel denoising flips that ratio: each forward pass does 256 tokens’ worth of useful arithmetic on the same memory traffic, as Google’s developer guide explains. The 4x speedup is not a clever optimization. It is reclaimed waste.

The deployment story is unusually complete for an “experimental” release. The quantized model uses NVIDIA’s NVFP4 lightweight data format to squeeze into 18GB of VRAM, per SiliconANGLE — under the memory ceiling of a single consumer flagship card, no server required. Google worked directly with the vLLM team on native integration, meaning DiffusionGemma serves through the same OpenAI-compatible local endpoint developers already script against; swapping it into an existing pipeline is a one-line URL change. That is not how research artifacts usually ship. It is how platform plays ship.

The bidirectional canvas buys qualitative abilities sequential models structurally lack. Because every token in the block attends to every other token — including ones “later” in the text — the model can correct errors mid-generation and build non-linear structures like code infills, tables, and mathematical layouts from the inside out. The most striking evidence is a parlor trick with teeth: fine-tuned on Sudoku, DiffusionGemma went from 0% to 80% solve rates, per MarkTechPost — a task that is nearly impossible left-to-right because constraints propagate in every direction at once. Sudoku is not a market. Constraint-bound generation — schemas, forms, structured code — very much is.

Context makes the throughput legible. Here is where diffusion-based generation now stands across the small field that exists:

ModelHardwareTokens/sec
DiffusionGemma 26BH1001,000+
DiffusionGemma 26BRTX 5090700+
Mercury 2Blackwell1,009
Mercury (2025)H1001,000+

Inception Labs’ original Mercury hit 1,000+ tokens per second on an H100 and tied for second on Copilot Arena while outpacing speed-optimized incumbents, per its arXiv technical report. Mercury 2, the first diffusion reasoning model, answers end-to-end in 1.7 seconds where Gemini 3 Flash takes 14.4 and Claude Haiku 4.5 takes 23.4 with reasoning enabled.

Stitch those rows together and a proprietary observation falls out: a $2,000 consumer RTX 5090 running DiffusionGemma delivers roughly 70% of the throughput Mercury 2 achieves on data-center Blackwell silicon — frontier-adjacent diffusion speed at perhaps a twentieth of the hardware cost. And the headroom is the real story. The 256-token canvas finalizes only about 15–20 tokens per forward pass, per MarkTechPost’s analysis — roughly 6–8% of the theoretical one-pass ceiling. Today’s 4x speedup uses less than a tenth of the parallelism the architecture permits. Better denoising schedules alone, no new hardware, could plausibly push that 4x toward 10x. Autoregression has no comparable lever left to pull; it is already bandwidth-bound.

That asymmetry explains why Google released this as open weights rather than a product. Inception’s Mercury launch proved a startup could commercialize the paradigm; what diffusion lacked was an ecosystem — fine-tuning recipes, quantization formats, serving infrastructure, a community finding the use cases. Apache 2.0 weights on consumer hardware recruit exactly that. Google is crowdsourcing the search for diffusion’s killer application while keeping its frontier autoregressive franchise untouched.

Where the ink refuses to settle

The most honest sentence in the launch is Google’s own caveat: “DiffusionGemma’s overall output quality is lower than standard Gemma 4,” with the company recommending its conventional model for maximum-quality applications, per the announcement. That is not a footnote; it is the whole trade. Speed without quality is a solved problem — small autoregressive models are fast too. Diffusion’s bet is that parallel generation can close the quality gap while keeping the speed edge, and DiffusionGemma is evidence the gap still exists, published by the party most motivated to deny it.

The speedup also comes with an asterisk that most coverage buried. Google’s developer guide notes the gains accrue primarily to local, low-concurrency, single-user inference — the regime where autoregressive decoding starves on memory bandwidth. Cloud providers already escape that regime by batching hundreds of requests until their GPUs are compute-bound anyway, which is why hyperscale serving sees diminishing returns from diffusion. The paradigm shift, at least in this iteration, is a local-AI story. That is a real market — but it is not the market where the inference dollars currently live.

History counsels skepticism about diffusion moments, too. Google demoed Gemini Diffusion at I/O 2025 to genuine excitement, then said essentially nothing about it for thirteen months — a silence that usually signals unresolved quality problems, not stealth progress. Inception Labs has shipped commercial diffusion models for sixteen months, posted arena-competitive coding results, and still occupies a niche; no major coding assistant has switched. If raw generation speed were a decisive purchasing criterion, Mercury’s 10x latency advantage would have moved more share than it has. Developers, it turns out, will wait twenty seconds for an answer that is right.

Notice, too, what the launch materials do not contain: a standard benchmark table. Google published throughput numbers, hardware footprints, and a Sudoku anecdote — but no MMLU, no GPQA, no coding-eval comparison against Gemma 4 or against Mercury. For a company that ships exhaustive eval cards with every Gemini release, the omission is informative. “Lower quality than standard Gemma 4” is doing an unquantified amount of work in that sentence, and until independent evaluations land on the open weights, nobody outside Google knows whether the gap is two points or twenty. The open release means we will find out quickly. It also means Google did not want to be the one to say.

There are structural objections as well. Diffusion models cannot stream tokens incrementally — the user sees nothing until the block resolves — which breaks the perceived-latency tricks every chat product relies on. The ~15–20 tokens finalized per pass means a 256-token block still requires 13–17 full forward passes, so the “parallel generation” framing flatters what is really a better amortization schedule. And the experimental label cuts both ways: Apache 2.0 weights with no production support contract is how Google hedges, shipping curiosity-grade artifacts while the frontier race that pays its bills stays firmly autoregressive. Anyone declaring the scribe dead should note that the scribe’s employer just licensed Siri for a billion dollars a year.

The competitive counterread is the sharpest: DiffusionGemma may be less a paradigm bet than a flank attack. Open, fast, local models erode the margins of mid-tier API serving — the territory where Microsoft is positioning its in-house MAI family and where every startup selling cheap inference lives. Google loses nothing by commoditizing that layer; its moat is Gemini at the frontier and TPUs underneath. If diffusion works, Google owns the seminal open release. If it stalls, Google spent one experimental model proving it. Heads I win, tails you lose is a strategy, not a science program — and it is worth remembering which company is playing it.

The denoising curve is the number to watch

Forget the 4x headline; track the acceptance rate. DiffusionGemma’s entire economic argument compresses into one ratio — tokens finalized per forward pass — currently sitting at 15–20 out of a 256-token canvas. Every improvement to that number is pure speed, free of new silicon, and the research lever (better noise schedules, smarter confidence thresholds, learned denoising orders) is exactly the kind of problem an open-weights community iterates on fastest. If community forks push acceptance toward 50 tokens per pass within two quarters, diffusion’s speed advantage stops being a demo statistic and starts being a deployment decision. If it stalls at 20, this release joins Gemini Diffusion in the drawer of intriguing silences.

The second signal is who fine-tunes it, and for what. The Sudoku result — 0% to 80% with task-specific tuning — sketches the playbook: diffusion models excel where output must satisfy bidirectional constraints, like schema-bound JSON, infilled code, structured documents, and form generation. Watch Hugging Face for derivative checkpoints in those niches. A handful of high-download structured-generation fine-tunes would validate the architecture’s differentiated value far more convincingly than throughput charts. Watch, too, whether Inception Labs treats this as validation or invasion; Mercury 2’s reasoning-mode results suggest the startup is one generation ahead, and an Apache 2.0 competitor from a hyperscaler is the classic kill-shot to a thin-moat startup.

There is also a hardware-economics subplot worth pricing. If diffusion’s compute-bound profile holds, it changes what an “inference GPU” needs to be: less exotic high-bandwidth memory, more raw arithmetic — a mix that favors cheaper silicon and, conveniently, the consumer RTX cards NVIDIA already manufactures at scale. NVIDIA’s day-one RTX AI Garage push reads as recognition of exactly that: a paradigm that makes gaming GPUs first-class inference hardware expands NVIDIA’s addressable market without cannibalizing data-center margins. Watch whether the next consumer GPU generation markets tokens-per-second the way this one markets frame rates.

The third signal is whether anyone ships product on it. The local-AI niche DiffusionGemma targets — 18GB VRAM, consumer GPUs, no API dependency — overlaps precisely with the privacy-sensitive and latency-sensitive workloads enterprises keep on-premises. An IDE vendor shipping local diffusion-powered autocomplete, or an OEM bundling it into an AI PC image, would mark the crossing from research artifact to infrastructure. Until then, the honest classification is: the most credible open experiment yet in a paradigm that has spent sixteen months being impressive and unadopted.

A final, quieter implication: every paradigm that makes capable models cheaper to run locally shifts a sliver of leverage from API vendors to their customers. The 256K context window and multimodal input mean DiffusionGemma is not a toy in scope, merely in polish — and polish is precisely what open communities supply. Eight years of autoregressive monoculture concentrated inference economics in the hands of whoever owned the biggest memory-bandwidth budget. A viable second paradigm, even a partial one, reprices that concentration. That alone justifies the experiment’s existence, whatever its benchmark scores turn out to be.

For operators, the checklist is short and concrete:

  • Benchmark it on structured-output workloads first. Diffusion’s edge is bidirectional constraint satisfaction, not prose. If you generate schema-bound JSON, code infills, or templated documents, run head-to-head evals against your current model; that is where the 4x speed costs the least quality.
  • Price the local-inference math before renewing API contracts. At 700+ tokens per second on a $2,000 RTX 5090, a single workstation delivers throughput that priced like data-center hardware a year ago. For high-volume, quality-tolerant pipelines (classification, drafting, extraction), the per-token cost approaches electricity.
  • Do not migrate chat or agentic workloads. No streaming, lower output quality than Gemma 4, and diminishing gains under batching make diffusion the wrong tool for user-facing conversation and multi-step agents — the places most teams spend most of their tokens.
  • Watch the acceptance-rate literature, not the press releases. Tokens-per-forward-pass is the single metric that decides whether diffusion’s 4x becomes 10x or stays a curiosity. Track community forks and the vLLM serving benchmarks.
  • If you build inference infrastructure, hedge now. Diffusion serving inverts autoregressive assumptions — no KV-cache streaming, block-level scheduling, bidirectional attention kernels. The vLLM integration is the reference implementation; understanding it is cheap insurance against the paradigm actually working.

The scribe is not dead. But for the first time, the printing press is sitting on a public bench with the license that lets anyone improve it — and the most interesting number in AI this week is not a benchmark score. It is how much of a 256-token page resolves with each pass of the ink.

In other news

OpenAI routes Codex through Oracle’s cloud — OpenAI announced that enterprises will be able to access its frontier models and Codex directly through Oracle Cloud Infrastructure, applying existing Oracle spending commitments toward OpenAI usage with availability expected in the coming weeks. The deal lowers procurement friction for Oracle’s enterprise base and deepens an already sprawling OpenAI–Oracle relationship (StartupHub.ai).

Moonshot AI chases a $30 billion valuation — China’s Moonshot AI, maker of the Kimi chatbot, is seeking up to $2 billion in its third funding round in six months at a $30 billion valuation — a sevenfold jump from the $4 billion it commanded in December, fueled by ARR that doubled to $200 million in roughly six weeks (The Next Web).

MIT finds AI assistance erodes misinformation defenses — An MIT Media Lab study tracking 67 participants over four weeks found AI chatbots boosted misinformation detection by 21% while in use, but participants’ unassisted accuracy fell 15.3 percentage points below baseline once the AI was removed — a “dependency paradox” the authors compare to GPS degrading navigation skills (MIT News).

Google rents SpaceX’s orbital compute for $920 million a month — Google will pay SpaceX roughly $920 million per month from October 2026 through June 2029 for access to approximately 110,000 NVIDIA GPUs plus supporting hardware, a multi-year commitment that lands days before SpaceX’s record-setting IPO prices (TechCrunch).

SoftBank commits up to €75 billion to French AI data centers — SoftBank will invest as much as €75 billion ($87 billion) to build 5 gigawatts of AI data center capacity in France, starting with a €45 billion first phase delivering 3.1 GW in the Hauts-de-France region by 2031 in partnership with EDF and Schneider Electric (TechCrunch).