How to Run Gemma 4 Locally
Run Google Gemma 4 locally with Unsloth Studio or llama.cpp.

Run Google Gemma 4 locally with Unsloth Studio or llama.cpp.
This guide is for developers who want to run Google’s Gemma 4 models on a laptop, desktop, or workstation without relying on a hosted API. After you follow the steps, you will have a local setup for downloading, launching, and chatting with Gemma 4, plus the settings you need for thinking mode, multimodal input, and memory planning.
You can use either Unsloth documentation and Unsloth on GitHub for a browser-based workflow, or llama.cpp on GitHub for direct local inference. Gemma 4 is Apache-2.0 licensed, supports text, image, and audio on selected variants, and can run with quantized weights to fit smaller machines.
Before you start
Get the latest AI news in your inbox
Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.
No spam. Unsubscribe at any time.
- Google or Hugging Face account for model downloads.
- Local machine with macOS, Windows, Linux, or WSL.
- Node not required.
- Python 3.10+ for Unsloth Studio workflows.
- CMake 3.22+ and a C++ compiler for building llama.cpp.
- Git 2.30+ installed.
- NVIDIA GPU optional, but helpful for faster inference.
- At least 8 GB RAM for Gemma-4-12B in 4-bit, or 5 GB RAM for E2B in 4-bit.
- Hugging Face CLI or pip access for model downloads.
Step 1: Choose a Gemma 4 variant
Goal: pick the model that matches your hardware before you download anything. Gemma 4 comes in E2B, E4B, 12B Unified, 26B-A4B, and 31B, with different memory needs and tradeoffs between speed and quality.

Use the smallest model that still matches your task. E2B and E4B are best for laptops and edge devices. 12B Unified is a balanced local multimodal option. 26B-A4B is the speed and quality middle ground. 31B is the strongest model if you can afford the memory.
Verification: you should be able to state the target memory budget, such as 8 GB for 12B at 4-bit or 20 GB for 31B at 4-bit.
Step 2: Install Unsloth Studio
Goal: get a browser UI that can search, download, and run Gemma 4 locally. Unsloth Studio supports GGUF and MLX files and can auto-set inference parameters for you.

Install it following the Unsloth Studio guide in the docs, then launch the local server and open the UI in your browser. The workflow is: install, start the app, and sign in with the local password you create on first launch.
python -m pip install unsloth-studioVerification: you should see the Studio UI at http://127.0.0.1:8888 and be able to reach the Chat tab.
Step 3: Download the Gemma 4 model
Goal: fetch the quantized model that fits your device. In Unsloth Studio, search for Gemma 4 in the model browser and download the quant you want. In direct workflows, use Hugging Face and choose a GGUF or MLX build.
If you are starting with local inference, use 8-bit for E2B or E4B, and Dynamic 4-bit for 12B, 26B-A4B, or 31B. If downloads stall, the source recommends checking Hugging Face Hub and XET debugging guidance.
Verification: you should see the model file or shard list fully downloaded, with enough free memory left for runtime overhead.
Step 4: Run Gemma 4 with the right chat settings
Goal: start inference with Gemma 4’s expected prompt format and reasoning controls. Gemma 4 uses standard system, user, and assistant roles, and it can enable or disable thinking with a chat template flag.
For llama.cpp, the source recommends llama-server when you want to disable reasoning reliably. Use the chat-template kwargs flag to turn thinking off, and keep only the final visible answer in multi-turn history.
llama-server -m model.gguf --chat-template-kwargs '{// Related Articles