How to Run Gemma 4 Locally

OraCore Editors

Back to home

[IND] June 7, 20264 min readOraCore Editors

How to Run Gemma 4 Locally

Run Google Gemma 4 locally with Unsloth Studio or llama.cpp.

Share LinkedIn

Run Google Gemma 4 locally with Unsloth Studio or llama.cpp.

This guide is for developers who want to run Google’s Gemma 4 models on a laptop, desktop, or workstation without relying on a hosted API. After you follow the steps, you will have a local setup for downloading, launching, and chatting with Gemma 4, plus the settings you need for thinking mode, multimodal input, and memory planning.

You can use either Unsloth documentation and Unsloth on GitHub for a browser-based workflow, or llama.cpp on GitHub for direct local inference. Gemma 4 is Apache-2.0 licensed, supports text, image, and audio on selected variants, and can run with quantized weights to fit smaller machines.

Before you start

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

Google or Hugging Face account for model downloads.
Local machine with macOS, Windows, Linux, or WSL.
Node not required.
Python 3.10+ for Unsloth Studio workflows.
CMake 3.22+ and a C++ compiler for building llama.cpp.
Git 2.30+ installed.
NVIDIA GPU optional, but helpful for faster inference.
At least 8 GB RAM for Gemma-4-12B in 4-bit, or 5 GB RAM for E2B in 4-bit.
Hugging Face CLI or pip access for model downloads.

Step 1: Choose a Gemma 4 variant

Goal: pick the model that matches your hardware before you download anything. Gemma 4 comes in E2B, E4B, 12B Unified, 26B-A4B, and 31B, with different memory needs and tradeoffs between speed and quality.

Use the smallest model that still matches your task. E2B and E4B are best for laptops and edge devices. 12B Unified is a balanced local multimodal option. 26B-A4B is the speed and quality middle ground. 31B is the strongest model if you can afford the memory.

Verification: you should be able to state the target memory budget, such as 8 GB for 12B at 4-bit or 20 GB for 31B at 4-bit.

Step 2: Install Unsloth Studio

Goal: get a browser UI that can search, download, and run Gemma 4 locally. Unsloth Studio supports GGUF and MLX files and can auto-set inference parameters for you.

Install it following the Unsloth Studio guide in the docs, then launch the local server and open the UI in your browser. The workflow is: install, start the app, and sign in with the local password you create on first launch.

python -m pip install unsloth-studio

Verification: you should see the Studio UI at http://127.0.0.1:8888 and be able to reach the Chat tab.

Step 3: Download the Gemma 4 model

Goal: fetch the quantized model that fits your device. In Unsloth Studio, search for Gemma 4 in the model browser and download the quant you want. In direct workflows, use Hugging Face and choose a GGUF or MLX build.

If you are starting with local inference, use 8-bit for E2B or E4B, and Dynamic 4-bit for 12B, 26B-A4B, or 31B. If downloads stall, the source recommends checking Hugging Face Hub and XET debugging guidance.

Verification: you should see the model file or shard list fully downloaded, with enough free memory left for runtime overhead.

Step 4: Run Gemma 4 with the right chat settings

Goal: start inference with Gemma 4’s expected prompt format and reasoning controls. Gemma 4 uses standard system, user, and assistant roles, and it can enable or disable thinking with a chat template flag.

For llama.cpp, the source recommends llama-server when you want to disable reasoning reliably. Use the chat-template kwargs flag to turn thinking off, and keep only the final visible answer in multi-turn history.

llama-server -m model.gguf --chat-template-kwargs '{

// Related Articles

How to Run Gemma 4 Locally

Before you start

Get the latest AI news in your inbox

Step 1: Choose a Gemma 4 variant

Step 2: Install Unsloth Studio

Step 3: Download the Gemma 4 model

Step 4: Run Gemma 4 with the right chat settings

4 hail risks for Colorado on Monday

Denver Hail Storm Slams Metro and DIA

5 storm timing cues for Denver this week

Denver hailstorm turns roads into a damage checklist

A.J. Brown Trade Talks Tilt Toward Eagles

5 steps to connect Codex with DeepSeek