Run Google Gemma 4 Locally Guide

Google's Gemma 4 is one of the most capable open-weight language models available - and you can run it entirely on your own computer without any cloud API, internet connection, or subscription. Your data stays local, latency is zero, and it's completely free. This guide covers every method to get Gemma 4 running on Windows, macOS, and Linux.

Gemma 4 Model Variants

Gemma 4 comes in several sizes. Choose based on your hardware:

Gemma 4 Models: Pick Your Size

Model	Parameters	RAM (Q4)	VRAM (GPU)	Best For
Gemma 4 1B	1 billion	~2 GB	~1 GB	Phones, Raspberry Pi, embedded
Gemma 4 4B	4 billion	~4 GB	~3 GB	Laptops, coding assistant, chatbots
Gemma 4 12B ⭐	12 billion	~8 GB	~8 GB	Best balance - most users should start here
Gemma 4 27B	27 billion	~18 GB	~16 GB	High-quality reasoning, complex tasks

Hardware Requirements

Minimum Hardware for Each Model

💻 CPU-Only (No GPU)

🧠RAM: 8 GB min (16 GB recommended)

⚙CPU: Any modern x86_64 or Apple Silicon

🐢Speed: 5-15 tokens/sec (4B model)

🎯Best for: 1B and 4B models

⚡ With GPU (Recommended)

🎮NVIDIA: RTX 3060+ (8 GB VRAM) or RTX 4090

🍎Apple Silicon: M1/M2/M3/M4 (unified memory)

🚀Speed: 30-80+ tokens/sec (12B model)

🎯Best for: 12B and 27B models

Method 1: Ollama (Easiest - Recommended)

Ollama is the simplest way to run LLMs locally. One command to install, one command to run. Works on Windows, macOS, and Linux with automatic GPU detection.

Ollama: Install → Pull → Run

📥InstallOne command

→

📦Pull ModelDownloads weights

→

💬ChatStart talking!

macOS

# Install Ollama
brew install ollama

# Or download from https://ollama.com/download/mac

# Start Ollama (runs in background)
ollama serve

# Pull and run Gemma 4 12B (recommended)
ollama run gemma4:12b

# You're now chatting with Gemma 4 locally!
# >>> What is the difference between TCP and UDP?
# TCP is a connection-oriented protocol that guarantees...

# Other model sizes:
ollama run gemma4:1b     # Smallest, fastest
ollama run gemma4:4b     # Good for laptops
ollama run gemma4:27b    # Best quality (needs 18+ GB RAM)

# Apple Silicon (M1/M2/M3/M4) automatically uses Metal GPU
# You'll see: "using Metal GPU" in the logs

Linux (Ubuntu / Debian / Fedora)

# Install Ollama (one-liner)
curl -fsSL https://ollama.com/install.sh | sh

# Start the server
ollama serve &

# Pull and run Gemma 4
ollama run gemma4:12b

# For NVIDIA GPU acceleration:
# 1. Install NVIDIA drivers (if not already)
sudo apt install nvidia-driver-550  # Ubuntu
# 2. Ollama auto-detects CUDA GPUs - no extra config needed!

# Verify GPU is being used:
ollama ps
# NAME         SIZE    PROCESSOR
# gemma4:12b   8.1 GB  100% GPU    ← Running on GPU!

Windows

# Option 1: Download installer
# Go to https://ollama.com/download/windows
# Run OllamaSetup.exe - installs as a system service

# Option 2: winget
winget install Ollama.Ollama

# Open PowerShell or Command Prompt:
ollama run gemma4:12b

# NVIDIA GPU: Install latest NVIDIA Game Ready drivers
# Ollama auto-detects CUDA - no manual config needed

# WSL2 (alternative): Install Ollama inside WSL2 Ubuntu
wsl
curl -fsSL https://ollama.com/install.sh | sh
ollama run gemma4:12b

Using Ollama as an API

# Ollama exposes a local REST API on port 11434
# Compatible with OpenAI API format!

# Chat completion
curl http://localhost:11434/api/chat -d '{
  "model": "gemma4:12b",
  "messages": [
    {"role": "user", "content": "Explain Kubernetes in 3 sentences"}
  ],
  "stream": false
}'

# Use from Python (with OpenAI SDK!)
# pip install openai
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # Required but ignored by Ollama
)

response = client.chat.completions.create(
    model="gemma4:12b",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function to reverse a linked list"},
    ],
)
print(response.choices[0].message.content)

# Works with ANY OpenAI-compatible library:
# - LangChain, LlamaIndex, AutoGen, CrewAI
# Just change base_url to http://localhost:11434/v1

Method 2: llama.cpp (Maximum Performance)

llama.cpp is a pure C/C++ inference engine - no Python, no frameworks, maximum speed. It supports GGUF quantized models and runs on CPU, CUDA, Metal, Vulkan, and ROCm.

# Build llama.cpp from source

# macOS (Metal GPU support)
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j

# Linux (NVIDIA CUDA)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

# Linux (AMD ROCm)
cmake -B build -DGGML_HIP=ON
cmake --build build --config Release -j

# Windows (Visual Studio + CUDA)
cmake -B build -DGGML_CUDA=ON -G "Visual Studio 17 2022"
cmake --build build --config Release

# Download Gemma 4 12B in GGUF format (quantized)
# From Hugging Face: search "gemma-4-12b-GGUF"
# Common quantizations:
#   Q4_K_M  - 4-bit, best speed/quality balance (~7 GB)
#   Q5_K_M  - 5-bit, better quality (~8.5 GB)
#   Q8_0    - 8-bit, near-original quality (~12 GB)
#   F16     - Full precision (~24 GB, needs lots of RAM)

# Run interactive chat
./build/bin/llama-cli \
  -m gemma-4-12b-Q4_K_M.gguf \
  -ngl 99 \                      # Offload all layers to GPU
  -c 8192 \                      # Context window (8K tokens)
  --interactive-first \
  -p "You are a helpful assistant."

# Run as server (OpenAI-compatible API)
./build/bin/llama-server \
  -m gemma-4-12b-Q4_K_M.gguf \
  -ngl 99 \
  -c 8192 \
  --port 8080 \
  --host 0.0.0.0

# Now accessible at http://localhost:8080/v1/chat/completions
# Same API as OpenAI - works with any OpenAI SDK

Method 3: Hugging Face Transformers (Python)

Best for developers who want programmatic control, fine-tuning, or integration with ML pipelines.

# pip install transformers torch accelerate

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model (downloads ~8 GB on first run)
model_name = "google/gemma-4-12b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,    # Half precision (saves VRAM)
    device_map="auto",              # Auto GPU/CPU split
)

# Generate text
prompt = "Explain how DNS works in simple terms:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=500,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

# For lower VRAM: use 4-bit quantization
# pip install bitsandbytes
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)
# Now runs on GPUs with only 6 GB VRAM!

Understanding Quantization

Quantization reduces model precision from 16-bit floats to 4-bit or 8-bit integers - dramatically reducing memory usage with minimal quality loss.

Quantization: Memory vs Quality Trade-off (Gemma 4 12B)

F16 (full)

Q8_0

Q5_K_M

Q4_K_M ⭐

Q3_K_M

Practical Use Cases

# 1. Local coding assistant (with VS Code)
# Install "Continue" extension in VS Code
# Settings: set provider to "ollama", model to "gemma4:12b"
# Now you have GitHub Copilot - but local and free!

# 2. Private document Q&A (RAG)
# pip install langchain chromadb
from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA

llm = Ollama(model="gemma4:12b")
embeddings = OllamaEmbeddings(model="gemma4:12b")

# Load your documents into a vector store
vectorstore = Chroma.from_documents(documents, embeddings)

# Ask questions about YOUR data - no cloud, no data leaks
qa = RetrievalQA.from_chain_type(llm=llm, retriever=vectorstore.as_retriever())
answer = qa.invoke("What were last quarter's revenue numbers?")

# 3. CLI chatbot
# ollama run gemma4:12b
# Just start typing - it remembers conversation context

# 4. API backend for your app
# Run: ollama serve
# Your app calls http://localhost:11434/v1/chat/completions
# Zero latency, zero cost, complete privacy

Performance Tuning

# Ollama environment variables for tuning:

# Use more GPU layers (faster, more VRAM)
OLLAMA_NUM_GPU=99 ollama run gemma4:12b

# Limit context window (saves memory)
ollama run gemma4:12b --ctx-size 4096

# Set number of threads (CPU inference)
OLLAMA_NUM_THREADS=8 ollama run gemma4:12b

# Keep model loaded in memory (faster subsequent requests)
OLLAMA_KEEP_ALIVE=30m ollama run gemma4:12b

# Check what's running and resource usage
ollama ps
# NAME         SIZE    PROCESSOR    UNTIL
# gemma4:12b   8.1 GB  100% GPU     30 minutes

# Benchmark your setup
ollama run gemma4:12b --verbose
# Look for: "eval rate: XX tokens/s"
# Good targets:
#   CPU only:   5-15 tok/s
#   RTX 3060:   25-40 tok/s
#   RTX 4090:   60-100 tok/s
#   M3 Max:     40-60 tok/s

Method Comparison

Which Method Should You Use?

Method	Ollama	llama.cpp	Transformers
Ease of setup	Easiest (1 command)	Medium (compile)	Medium (pip)
Performance	Great	Best (native C++)	Good
API compatibility	OpenAI-compatible	OpenAI-compatible	HF API
Fine-tuning	No	No	Yes (LoRA, QLoRA)
GPU support	CUDA, Metal	CUDA, Metal, ROCm, Vulkan	CUDA, MPS
Best for	Most users	Power users	ML engineers

Troubleshooting

"Out of memory" - Use a smaller model (4B instead of 12B) or a more aggressive quantization (Q3 instead of Q4). Close other apps to free RAM.
"Slow generation (2 tok/s)" - You're running on CPU. Install NVIDIA drivers (Linux/Windows) or use Apple Silicon Mac for GPU acceleration.
"Model not found" - Check exact model name with ollama list. Pull the model first: ollama pull gemma4:12b.
"CUDA out of memory" - Your GPU VRAM is too small. Use Q4_K_M quantization, or split between GPU + CPU with -ngl 20 (only 20 layers on GPU).
"Metal not available" (macOS) - Update to macOS 13.3+ and Xcode command line tools: xcode-select --install.

Running LLMs locally has never been easier. With Ollama, you're one command away from having a private, free, and fast AI assistant. Start with ollama run gemma4:12b - it's the best balance of quality and speed for most hardware. For maximum performance, try llama.cpp. For ML research and fine-tuning, use Hugging Face Transformers. The future of AI is local.

Run Google Gemma 4 Locally: Complete Setup Guide for Windows, macOS, and Linux

Gemma 4 Model Variants

Hardware Requirements

Method 1: Ollama (Easiest - Recommended)

macOS

Linux (Ubuntu / Debian / Fedora)

Windows

Using Ollama as an API

Method 2: llama.cpp (Maximum Performance)

Method 3: Hugging Face Transformers (Python)

Understanding Quantization

Practical Use Cases

Performance Tuning

Method Comparison

Troubleshooting

Stuck on implementation?

Related Production Resources

Free learning tracks

Interactive engineering labs

Production cheatsheets

Key terms

Discussion

Discussion is unavailable

Gemma 4 Model Variants

Hardware Requirements

Method 1: Ollama (Easiest - Recommended)

macOS

Linux (Ubuntu / Debian / Fedora)

Windows

Using Ollama as an API

Method 2: llama.cpp (Maximum Performance)

Method 3: Hugging Face Transformers (Python)

Understanding Quantization

Practical Use Cases

Performance Tuning

Method Comparison

Troubleshooting

Stuck on implementation?

Related Production Resources

Free learning tracks

Interactive engineering labs

Production cheatsheets

Key terms

Discussion

Discussion is unavailable

Continue Reading

MCP Security in Production: How to Safely Run AI Agents with Tools, OAuth, and Gateways

Vector Databases Explained: Embeddings, Similarity Search, and When You Need One