Gemma 4 Download Hub

Download
Gemma 4
Open Models

The most capable open model family built from Gemini 3 research. Frontier intelligence free, Apache 2.0, runs on your hardware.

4 sizes
E2B · E4B · 26B · 31B
256K
Max context window
400M+
Total Gemma downloads
140+
Supported languages
Model Variants
Choose your size
Gemma 4 E2B
Dense · Edge Series
On-Device 2B Params
Context
128K
Min RAM
4 GB
VRAM
2 GB
Vision
✓ Yes
Audio
✓ Yes
Thinking
✓ Yes
Perfect for phones, Raspberry Pi, and offline apps. Near-zero latency on-device. Runs fully without a network connection.
Gemma 4 E4B
Dense · Edge Series
Edge Server 4B Params
Context
128K
Min RAM
8 GB
VRAM
4 GB
Vision
✓ Yes
Audio
✓ Yes
Thinking
✓ Yes
Sweet spot for developer laptops and Jetson Orin. Strong multimodal performance including audio. Ideal for local prototyping.
Gemma 4 31B
Dense · Server Grade
#3 Global 31B Params
Context
256K
Min RAM
64 GB
VRAM
24 GB
Vision
✓ Yes
Audio
— Large
Thinking
✓ Yes
Ranked #3 globally on Arena.ai. Maximum quality for enterprise workloads, complex reasoning, and 256K document processing on A100/H100/TPU.
Platforms & Runtimes
10+ supported
🤗
Hugging Face
Official model weights in Safetensors, GGUF & PyTorch. Direct integration with Transformers, TRL, and Candle.
pip install transformers accelerate
🦙
Ollama
One-line install for Mac, Linux & Windows. Handles quantization automatically. Best for local prototyping.
$ ollama run gemma4:26b
🖥️
LM Studio
GUI desktop app. No coding required. Supports all Gemma 4 sizes with built-in quantization selector.
# Search "google/gemma-4" in model browser
📊
Kaggle
Official Google distribution with free GPU notebooks. Great for experiments and prototyping without local setup.
$ kaggle models instances versions download google/gemma
🚀
vLLM
High-throughput production inference. Best for serving 26B / 31B models at scale with OpenAI-compatible API.
$ vllm serve google/gemma-4-26b-it --dtype auto
⚙️
llama.cpp
CPU-first quantized inference. Runs Gemma 4 GGUF models on any hardware. No GPU required for small models.
$ ./llama-cli -m gemma-4-e4b-it-q4_k_m.gguf -i
↗ GitHub
🍎
MLX (Apple Silicon)
Optimised for M1/M2/M3/M4 chips. Best performance on macOS with unified memory. Supports 4-bit quantization.
$ mlx_lm.generate --model mlx-community/gemma-4-e4b-4bit
💚
NVIDIA NIM
Containerised deployment for Jetson and data centre GPUs. Production-grade with health checks and monitoring.
$ docker run --gpus all nvcr.io/nim/google/gemma-4-27b-it
☁️
Vertex AI
Google Cloud managed serving on TPUs and GPUs. Enterprise SLAs, VPC, and compliance guarantees.
$ gcloud ai models deploy gemma4-26b --region=us-central1
Quickstart Code
Copy & run
Shell · Ollama
# Install Ollama (Mac / Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run Gemma 4 — choose your size
ollama run gemma4:e2b    # 2B  — phones, Pi
ollama run gemma4:e4b    # 4B  — laptop
ollama run gemma4:26b    # 26B MoE — workstation ★ recommended
ollama run gemma4:31b    # 31B dense — server / A100

# REST API (once running)
curl http://localhost:11434/api/chat \
  -d '{"model":"gemma4","messages":[{"role":"user","content":"Hello!"}]}'
Python · Hugging Face Transformers
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

# Load model — swap model_id for your chosen size
model_id = "google/gemma-4-26b-it"   # or e2b-it / e4b-it / 31b-it

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [{
    "role": "user",
    "content": [{"type": "text", "text": "Explain quantum entanglement."}]
}]

inputs = processor.apply_chat_template(
    messages, tokenize=True,
    return_dict=True, return_tensors="pt",
    add_generation_prompt=True,
    enable_thinking=True,           # ← thinking mode!
).to(model.device)

output = model.generate(**inputs, max_new_tokens=2048)
print(processor.decode(output[0], skip_special_tokens=True))
Python · Gemini API (Google AI Studio)
# pip install google-genai
from google import genai

client = genai.Client(api_key="YOUR_API_KEY")

response = client.models.generate_content(
    model="gemma-4-31b-it",       # or gemma-4-26b-it
    contents="Write a haiku about open-source AI.",
)
print(response.text)

# ── REST equivalent ─────────────────────────────────
# curl "https://generativelanguage.googleapis.com/v1beta/
#   models/gemma-4-31b-it:generateContent?key=YOUR_KEY" \
#   -H 'Content-Type: application/json' -X POST \
#   -d '{"contents":[{"parts":[{"text":"Hello!"}]}]}'
Shell + Python · vLLM Production Server
# Install vLLM
pip install vllm

# Start OpenAI-compatible server
vllm serve google/gemma-4-26b-it \
  --dtype auto \
  --max-model-len 65536 \
  --tensor-parallel-size 2    # for multi-GPU

# Query via OpenAI SDK
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="none",
)

resp = client.chat.completions.create(
    model="google/gemma-4-26b-it",
    messages=[{"role":"user","content":"Hello Gemma!"}],
)
print(resp.choices[0].message.content)
Shell · MLX (Apple Silicon M1–M4)
# Install MLX LM
pip install mlx-lm

# Run inference (4-bit quantized)
mlx_lm.generate \
  --model mlx-community/gemma-4-e4b-it-4bit \
  --prompt "Explain transformers simply." \
  --max-tokens 512

# Or 26B at 4-bit for M2 Max / M3 Max (≥64 GB)
mlx_lm.generate \
  --model mlx-community/gemma-4-26b-it-4bit \
  --prompt "Write a Python quicksort." \
  --max-tokens 1024

# Python API
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/gemma-4-e4b-it-4bit")
response = generate(model, tokenizer, prompt="Hello!", max_tokens=256)
print(response)
Shell · Docker / NVIDIA NIM
# NVIDIA NIM — containerised, production-ready
docker run --gpus all \
  -p 8000:8000 \
  nvcr.io/nim/google/gemma-4-27b-it:latest

# Unsloth Fine-tune Studio (Mac / Linux / WSL)
curl -fsSL https://unsloth.ai/install.sh | sh
unsloth studio -H 0.0.0.0 -p 8888

# Dockerfile snippet for custom build
FROM nvidia/cuda:12.4.0-runtime-ubuntu22.04
RUN pip install vllm transformers
COPY serve.py /app/serve.py
CMD ["python", "/app/serve.py"]
Benchmark Performance
vs leading models
Model MMLU HumanEval MATH GPQA Arena.ai Rank
Gemma 4 31B ★ Open
88.2
91.4
85.0
62.1
#3 Global
Gemma 4 26B A4B MoE
85.3
88.0
82.1
58.4
Top 5
GPT-4o
87.2
90.2
76.6
53.6
Llama 3.3 70B
83.6
80.5
77.0
50.7
Gemma 4 E4B Edge
72.1
70.3
68.0
42.0
Best Edge