律动BlockBeats|Jun 11, 2026 00:27
[Google Open-Sources Text Diffusion Model DiffusionGemma: Over 1,000 Tokens Per Second on a Single GPU, 4x Speed Boost]
According to monitoring by Beating, Google has released an experimental open-source large model, DiffusionGemma, which adopts a novel text generation mechanism based on diffusion, breaking the limitation of traditional large language models that generate text sequentially word by word. DiffusionGemma features a total of 26B parameters, with only 3.8B parameters activated per forward pass under the Mixture of Experts (MoE) architecture. By generating entire blocks of text in parallel, it achieves up to a 4x speed improvement in local GPU inference. Unlike the traditional "typewriter-style" word-by-word generation, DiffusionGemma operates similarly to image generation: it first creates random placeholders on a canvas, then iteratively removes noise and locks in the correct text through multiple time steps. Each forward pass can generate 256 tokens in parallel, enabling bidirectional attention interaction across all tokens. The bidirectional attention mechanism offers significant advantages in nonlinear generation tasks such as code completion, inline editing, and mathematical formula generation. However, the overall output quality of DiffusionGemma currently remains below that of the standard Gemma 4.
In hardware testing and inference speed performance, a single NVIDIA H100 GPU can generate over 1,000 tokens per second, while a consumer-grade NVIDIA GeForce RTX 5090 GPU exceeds 700 tokens per second. After 4-bit floating-point (NVFP4) quantization, inference memory usage can be reduced to under 18GB, significantly lowering the barrier for local deployment. DiffusionGemma weights have been open-sourced on Hugging Face and are supported by mainstream development tools such as MLX, vLLM, Unsloth, and NVIDIA NeMo. [Original Link]
Share To
Timeline
HotFlash
APP
X
Telegram
CopyLink