Gemma 4: Model Variants & Specifications
Choose the right size for your workload, hardware, and deployment environment
One Architecture, Multiple Scales: The Gemma 4 family shares a unified transformer foundation with optimized routing, quantization pipelines, and deployment tooling. Whether you're building on-device assistants, enterprise RAG systems, or research-grade agents, there's a Gemma 4 variant engineered for your constraints and performance targets.
Available Model Variants
Ultra-lightweight variant optimized for mobile, IoT, and browser-based inference.
Optimal trade-off between capability and efficiency for desktop & cloud workloads.
High-capability model for complex reasoning, coding, and enterprise-grade deployments.
Preview MoE variant delivering frontier-level capabilities for specialized & research use.
Quantization & Deployment Options
- FP16 / BF16: Full precision for maximum capability retention. Recommended for research & high-accuracy pipelines.
- INT8 Quantization: ~50% VRAM reduction with <1.5% capability drop. Ideal for cloud instances & mid-tier GPUs.
- INT4 Quantization: ~75% VRAM reduction with ~2.5% capability trade-off. Enables consumer GPU & edge deployment.
- MoE Routing (270B only): Activates only 2-3 experts per token, reducing active compute by ~60% while preserving capacity.
- Framework Support: Official weights available for PyTorch, JAX, GGUF, safetensors, and ONNX export.
INT4 variants enable deployment on consumer hardware (RTX 3090/4090, Apple M-series), but complex reasoning & long-context tasks may benefit from INT8 or FP16 on data-center GPUs (A100/H100/MI300).
How to Choose the Right Model
Define Latency & Throughput Requirements
Real-time interactive apps → 2B/9B. Batch processing & complex reasoning → 27B/270B.
Assess Hardware Constraints
Mobile/Edge → 2B INT4. Desktop → 9B INT8/INT4. Cloud/Enterprise → 27B FP16/INT8. Research clusters → 270B MoE.
Match Task Complexity
Classification, summarization, light chat → 2B/9B. Coding, math, multi-step planning, RAG → 27B+. Frontier research → 270B.
Plan for Fine-Tuning
QLoRA works efficiently on 9B/27B. Full fine-tuning recommended only for 27B+ on multi-GPU setups.
Access Models & Integration Tools
Download weights, run inference, or integrate via API:
⚠️ Licensing & Usage Notice
Gemma 4 weights are released under a permissive open-weight license allowing commercial use, modification, and distribution. Usage must comply with Google's AI Principles and acceptable use policy. The 270B variant is currently in limited preview and requires approval for commercial deployment.