Gemma 4 Download Hub

Download
Gemma 4
Open Models

The most capable open model family built from Gemini 3 research. Frontier intelligence free, Apache 2.0, runs on your hardware.

🤗 Download on Hugging Face 📖 How to Download Gemma 4

4 sizes

E2B · E4B · 26B · 31B

256K

Max context window

400M+

Total Gemma downloads

140+

Supported languages

Model Variants

Choose your size

Gemma 4 E2B

Dense · Edge Series

On-Device 2B Params

Context

128K

Min RAM

4 GB

VRAM

2 GB

Vision

✓ Yes

Audio

✓ Yes

Thinking

✓ Yes

Perfect for phones, Raspberry Pi, and offline apps. Near-zero latency on-device. Runs fully without a network connection.

Gemma 4 E4B

Dense · Edge Series

Edge Server 4B Params

Context

128K

Min RAM

8 GB

VRAM

4 GB

Vision

✓ Yes

Audio

✓ Yes

Thinking

✓ Yes

Sweet spot for developer laptops and Jetson Orin. Strong multimodal performance including audio. Ideal for local prototyping.

★ Most Popular

Gemma 4 26B A4B

Mixture-of-Experts

MoE 26B / 4B active

Context

256K

Min RAM

32 GB

VRAM

16 GB

Vision

✓ Yes

Audio

— Large

Thinking

✓ Yes

26B total params, only 4B active per pass — outruns models 20× its size. Runs on a single RTX 4090. Best efficiency-to-quality ratio available.

Gemma 4 31B

Dense · Server Grade

#3 Global 31B Params

Context

256K

Min RAM

64 GB

VRAM

24 GB

Vision

✓ Yes

Audio

— Large

Thinking

✓ Yes

Ranked #3 globally on Arena.ai. Maximum quality for enterprise workloads, complex reasoning, and 256K document processing on A100/H100/TPU.

Platforms & Runtimes

10+ supported

🤗

Hugging Face

Official model weights in Safetensors, GGUF & PyTorch. Direct integration with Transformers, TRL, and Candle.

pip install transformers accelerate

↗ Open HF Hub

🦙

Ollama

One-line install for Mac, Linux & Windows. Handles quantization automatically. Best for local prototyping.

$ ollama run gemma4:26b

↗ Ollama Library

🖥️

LM Studio

GUI desktop app. No coding required. Supports all Gemma 4 sizes with built-in quantization selector.

# Search "google/gemma-4" in model browser

↗ Download LM Studio

📊

Kaggle

Official Google distribution with free GPU notebooks. Great for experiments and prototyping without local setup.

$ kaggle models instances versions download google/gemma

↗ Open Kaggle

🚀

vLLM

High-throughput production inference. Best for serving 26B / 31B models at scale with OpenAI-compatible API.

$ vllm serve google/gemma-4-26b-it --dtype auto

↗ GitHub vLLM

⚙️

llama.cpp

CPU-first quantized inference. Runs Gemma 4 GGUF models on any hardware. No GPU required for small models.

$ ./llama-cli -m gemma-4-e4b-it-q4_k_m.gguf -i

↗ GitHub

🍎

MLX (Apple Silicon)

Optimised for M1/M2/M3/M4 chips. Best performance on macOS with unified memory. Supports 4-bit quantization.

$ mlx_lm.generate --model mlx-community/gemma-4-e4b-4bit

↗ MLX GitHub

💚

NVIDIA NIM

Containerised deployment for Jetson and data centre GPUs. Production-grade with health checks and monitoring.

$ docker run --gpus all nvcr.io/nim/google/gemma-4-27b-it

↗ NVIDIA Build

☁️

Vertex AI

Google Cloud managed serving on TPUs and GPUs. Enterprise SLAs, VPC, and compliance guarantees.

$ gcloud ai models deploy gemma4-26b --region=us-central1

↗ Vertex AI

Quickstart Code

Copy & run

Shell · Ollama

# Install Ollama (Mac / Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run Gemma 4 — choose your size
ollama run gemma4:e2b    # 2B  — phones, Pi
ollama run gemma4:e4b    # 4B  — laptop
ollama run gemma4:26b    # 26B MoE — workstation ★ recommended
ollama run gemma4:31b    # 31B dense — server / A100

# REST API (once running)
curl http://localhost:11434/api/chat \
  -d '{"model":"gemma4","messages":[{"role":"user","content":"Hello!"}]}'

Python · Hugging Face Transformers

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

# Load model — swap model_id for your chosen size
model_id = "google/gemma-4-26b-it"   # or e2b-it / e4b-it / 31b-it

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [{
    "role": "user",
    "content": [{"type": "text", "text": "Explain quantum entanglement."}]
}]

inputs = processor.apply_chat_template(
    messages, tokenize=True,
    return_dict=True, return_tensors="pt",
    add_generation_prompt=True,
    enable_thinking=True,           # ← thinking mode!
).to(model.device)

output = model.generate(**inputs, max_new_tokens=2048)
print(processor.decode(output[0], skip_special_tokens=True))

Python · Gemini API (Google AI Studio)

# pip install google-genai
from google import genai

client = genai.Client(api_key="YOUR_API_KEY")

response = client.models.generate_content(
    model="gemma-4-31b-it",       # or gemma-4-26b-it
    contents="Write a haiku about open-source AI.",
)
print(response.text)

# ── REST equivalent ─────────────────────────────────
# curl "https://generativelanguage.googleapis.com/v1beta/
#   models/gemma-4-31b-it:generateContent?key=YOUR_KEY" \
#   -H 'Content-Type: application/json' -X POST \
#   -d '{"contents":[{"parts":[{"text":"Hello!"}]}]}'

Shell + Python · vLLM Production Server

# Install vLLM
pip install vllm

# Start OpenAI-compatible server
vllm serve google/gemma-4-26b-it \
  --dtype auto \
  --max-model-len 65536 \
  --tensor-parallel-size 2    # for multi-GPU

# Query via OpenAI SDK
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="none",
)

resp = client.chat.completions.create(
    model="google/gemma-4-26b-it",
    messages=[{"role":"user","content":"Hello Gemma!"}],
)
print(resp.choices[0].message.content)

Shell · MLX (Apple Silicon M1–M4)

# Install MLX LM
pip install mlx-lm

# Run inference (4-bit quantized)
mlx_lm.generate \
  --model mlx-community/gemma-4-e4b-it-4bit \
  --prompt "Explain transformers simply." \
  --max-tokens 512

# Or 26B at 4-bit for M2 Max / M3 Max (≥64 GB)
mlx_lm.generate \
  --model mlx-community/gemma-4-26b-it-4bit \
  --prompt "Write a Python quicksort." \
  --max-tokens 1024

# Python API
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/gemma-4-e4b-it-4bit")
response = generate(model, tokenizer, prompt="Hello!", max_tokens=256)
print(response)

Shell · Docker / NVIDIA NIM

# NVIDIA NIM — containerised, production-ready
docker run --gpus all \
  -p 8000:8000 \
  nvcr.io/nim/google/gemma-4-27b-it:latest

# Unsloth Fine-tune Studio (Mac / Linux / WSL)
curl -fsSL https://unsloth.ai/install.sh | sh
unsloth studio -H 0.0.0.0 -p 8888

# Dockerfile snippet for custom build
FROM nvidia/cuda:12.4.0-runtime-ubuntu22.04
RUN pip install vllm transformers
COPY serve.py /app/serve.py
CMD ["python", "/app/serve.py"]

Benchmark Performance

vs leading models

Model	MMLU	HumanEval	MATH	GPQA	Arena.ai Rank
Gemma 4 31B ★ Open	88.2	91.4	85.0	62.1	#3 Global
Gemma 4 26B A4B MoE	85.3	88.0	82.1	58.4	Top 5
GPT-4o	87.2	90.2	76.6	53.6	—
Llama 3.3 70B	83.6	80.5	77.0	50.7	—
Gemma 4 E4B Edge	72.1	70.3	68.0	42.0	Best Edge

DownloadGemma 4Open Models

Download
Gemma 4
Open Models