Skip to main content
📝 AI Models

Run Gemma 4 Free Inside Claude Code With Ollama

Set up Google's Gemma 4 models inside Claude Code using Ollama for free, private, local AI coding. Full walkthrough with hardware guide and real results.

21 min

Read time

4,126

Words

Apr 11, 2026

Published

Engr Mejba Ahmed

Written by

Engr Mejba Ahmed

Share Article

Run Gemma 4 Free Inside Claude Code With Ollama

Run Gemma 4 Free Inside Claude Code With Ollama

The moment I stopped paying for AI coding tokens was a Tuesday afternoon.

I'd been burning through Claude API credits on a content automation pipeline — nothing exotic, just a multi-agent workflow that scraped, summarized, and reformatted data across four websites. The kind of project where you don't realize you've made 400 API calls until the billing dashboard sends you a polite notification that you've crossed your soft limit. Again.

I'd already reviewed every Gemma 4 model Google shipped on April 2, 2026. The benchmarks were solid. The 26B Mixture of Experts model impressed me with its speed-to-quality ratio. But I hadn't plugged any of them into my actual daily tool — Claude Code — as a full replacement for cloud inference. I assumed the gap between a locally-run open model and Anthropic's servers would make the experience frustrating.

I was wrong about that. Spectacularly wrong.

Within an hour of configuring Ollama to serve Gemma 4's 26B model through Claude Code's Anthropic-compatible endpoint, I had the same file editing, tool calling, bash execution, and multi-step coding workflow I'd been paying for — running entirely on my own hardware. No API key. No billing dashboard. No data leaving my machine. And fast enough that I stopped checking whether responses were slower than the cloud version, because most of the time they weren't noticeably different.

This isn't a theoretical setup. I've been running it for over a week now on real projects. Here's exactly how to build the same workflow, which Gemma 4 model to pick for your hardware, and where the experience genuinely shines versus where it still falls short.

Why Gemma 4 Specifically — And Not Any Other Local Model

I've tested a lot of local models through Claude Code. Qwen 3.5, Llama 4 Scout, DeepSeek variants, Phi models. I wrote an entire guide on running Claude Code free with Ollama that covers the general approach. So why does Gemma 4 deserve its own setup article?

Three reasons, and they compound.

Token efficiency changes the math. In my hands-on Gemma 4 review, I measured the 26B model using roughly 2.5 times fewer output tokens than Qwen 3.5 for equivalent tasks. When you're running locally, fewer tokens means faster generation, less memory pressure, and shorter context windows consumed by the model's own responses. In an agentic coding loop where Claude Code chains five or six tool calls per task, that efficiency gap means the difference between a workflow that feels responsive and one that feels like you're waiting for a bus.

Native tool calling works without gymnastics. Google trained tool use directly into Gemma 4 — it wasn't fine-tuned on top of a base model. The practical effect: when Claude Code asks Gemma 4 to read a file, edit a function, or run a shell command, the model formats the tool call correctly on the first attempt far more often than other models I've tested at similar sizes. Ollama's April 2026 integration confirms that tool calling, file reads, file edits, and bash execution all work correctly through the Anthropic Messages API compatibility layer.

The Mixture of Experts architecture makes it fast on modest hardware. The 26B model only activates approximately 3.88 billion parameters per inference call. The rest sleeps. That means a model with 26 billion total parameters runs at speeds you'd expect from a 4B model — roughly 300 tokens per second on a Mac Studio M2 Ultra, according to Google's benchmarks. My own numbers were lower than that headline figure, but still faster than any comparably capable model I've run locally.

The combination — fast, efficient, reliable tool calling — makes Gemma 4 the first local model I'd actually recommend for daily Claude Code use without caveats about "it's good for simple tasks." It handles real coding work.

But before you install anything, you need to figure out which model fits your hardware. Getting this wrong wastes hours.

Pick the Right Gemma 4 Model for Your Hardware

Google shipped four models, and picking the wrong size is the most common mistake I see people make with local AI. Too small, and you'll get frustrated with output quality. Too large, and inference slows to a crawl or the model won't load at all.

Here's the lineup with realistic hardware requirements — not Google's optimistic marketing numbers, but what you actually need for a usable Claude Code experience:

Model Total Params Active Params Download Size Minimum VRAM/RAM Sweet Spot Hardware
gemma4:e2b 2B 2B ~1.5 GB 4 GB Phone, Raspberry Pi
gemma4:e4b 4B 4B ~9.6 GB 8 GB MacBook Air, entry GPU
gemma4:26b 26B (MoE) ~3.88B ~18 GB 16 GB MacBook Pro, RTX 3060+
gemma4:31b 31B (Dense) 31B ~20 GB 24 GB RTX 4090, Mac Studio

For Claude Code specifically, I recommend starting with the 26B MoE model. Here's why: Claude Code needs at least 64K tokens of context to function properly — the agentic features rely on holding file contents, conversation history, and tool outputs in memory simultaneously. The 26B model handles this context requirement while remaining fast enough for interactive coding. The E4B model works but hits quality ceilings on anything beyond simple file edits and straightforward code generation.

How to check if your hardware can handle it. Before downloading 18 GB of model weights and discovering your machine can't run it, use a hardware compatibility checker. Sites like WillItRunAI and CanIRun.ai let you input your GPU type, VRAM, system RAM, and GPU cores to get a compatibility estimate. Select the Gemma 4 variant you're considering, enter your specs, and the tool will tell you whether inference will be comfortable, possible-but-slow, or not feasible.

A few specifics from my testing across different hardware:

  • MacBook Pro M4 Pro (48 GB unified memory): The 26B model generates at roughly 51 tokens per second. Very comfortable for real coding work.
  • M2 Pro (16 GB): The 26B model manages 20-25 tokens per second. Usable, but you'll feel the pauses on longer outputs.
  • RTX 4090 (24 GB VRAM): The 31B dense model runs at about 41 tokens per second. The 26B MoE is significantly faster — well over 60 tokens per second.
  • RTX 3060 (12 GB VRAM): The E4B model runs smoothly. The 26B model will load with quantization but you'll be memory-constrained.

If you're sitting on an Apple Silicon Mac with 16 GB or more of unified memory, the 26B model with Q4_K_M quantization is your target. If you have a dedicated NVIDIA GPU with 24 GB VRAM, you can run the 31B dense model and get the highest quality outputs.

Now that you know which model to pull, here's the actual setup.

Step 1: Install Ollama

Ollama is the local model server that makes this entire workflow possible. Think of it as Docker for language models — you pull model images, Ollama manages the runtime, and your applications talk to it through a local API endpoint.

On macOS:

Download the installer from ollama.com or install via Homebrew:

brew install ollama

On Linux:

curl -fsSL https://ollama.com/install.sh | sh

On Windows (via WSL):

Install WSL first if you haven't, then follow the Linux instructions inside your WSL distribution. Native Windows support exists but WSL gives you a more consistent experience with Claude Code.

After installation, verify Ollama is running:

ollama --version

You should see version 0.6.x or later — earlier versions don't include the Anthropic Messages API compatibility that Claude Code needs.

Start the Ollama server if it isn't running automatically:

ollama serve

Keep this running in a terminal tab or set it as a background service. Every subsequent step depends on Ollama being active and listening on localhost:11434.

Step 2: Pull Your Gemma 4 Model

This is where your hardware decision from the previous section matters. Run the command for your chosen model:

# For most users — the sweet spot of speed and quality
ollama pull gemma4:26b

# For high-end hardware — maximum quality
ollama pull gemma4:31b

# For lighter setups — still capable for basic coding
ollama pull gemma4:e4b

The 26B model is approximately 18 GB. On a reasonable internet connection, expect 5-15 minutes for the download. Ollama handles all the quantization and optimization automatically — you don't need to manually configure GGUF files or conversion scripts.

Once the download completes, verify the model loaded correctly:

ollama run gemma4:26b "Write a Python function that reverses a linked list"

You should get a coherent code response within a few seconds. If the model takes more than 30 seconds to respond, your hardware may be struggling — consider dropping down to the E4B variant.

Critical configuration: set the context window. Claude Code requires at least 64K tokens of context to operate properly. Ollama defaults to a much smaller window. Create a Modelfile to override this:

# Create a custom Modelfile
cat <<EOF > Modelfile
FROM gemma4:26b
PARAMETER num_ctx 65536
EOF

# Create the custom model
ollama create gemma4-claude -f Modelfile

This creates a new model variant called gemma4-claude with a 65,536-token context window. Use this variant for all Claude Code work. Without this step, Claude Code will lose track of file contents mid-edit, forget earlier instructions, and produce fragmented changes. I learned this the hard way when my agent tried to refactor a 200-line service class and cleanly forgot the second half existed.

Step 3: Install Claude Code

If you don't already have Claude Code installed, the setup is straightforward across all platforms.

Prerequisites: Node.js 18+ must be installed on your system.

npm install -g @anthropic-ai/claude-code

This installs the Claude Code CLI globally. It works on macOS, Linux, Windows, and WSL.

Verify the installation:

claude --version

If you've been using Claude Code with an Anthropic API key, that's fine — we're going to redirect it to your local Ollama instance instead.

Step 4: Connect Claude Code to Ollama

This is the step where the magic happens. You're telling Claude Code to send its API requests to your local Ollama server instead of Anthropic's cloud.

Set the environment variables. The exact method depends on your operating system and shell.

For macOS/Linux (zsh or bash):

export ANTHROPIC_BASE_URL="http://localhost:11434"
export ANTHROPIC_AUTH_TOKEN="ollama"
export ANTHROPIC_API_KEY="sk-placeholder"

Add these to your ~/.zshrc or ~/.bashrc to make them permanent:

echo 'export ANTHROPIC_BASE_URL="http://localhost:11434"' >> ~/.zshrc
echo 'export ANTHROPIC_AUTH_TOKEN="ollama"' >> ~/.zshrc
echo 'export ANTHROPIC_API_KEY="sk-placeholder"' >> ~/.zshrc
source ~/.zshrc

For Windows (PowerShell):

$env:ANTHROPIC_BASE_URL = "http://localhost:11434"
$env:ANTHROPIC_AUTH_TOKEN = "ollama"
$env:ANTHROPIC_API_KEY = "sk-placeholder"

For permanent Windows variables, add these through System Properties > Environment Variables or your PowerShell profile.

What's happening here: Ollama exposes an API endpoint that mimics Anthropic's Messages API. Claude Code doesn't know the difference. It sends requests to what it thinks is Anthropic's server, Ollama intercepts them, routes them to your local Gemma 4 model, and returns responses in the exact format Claude Code expects. The ANTHROPIC_API_KEY value doesn't matter — it just needs to be non-empty so Claude Code doesn't complain about a missing key.

Step 5: Launch and Verify

Now fire up Claude Code with your local Gemma 4 model:

claude --model gemma4-claude

If you created the custom Modelfile with the 65K context window, use gemma4-claude. If you skipped that step (don't skip it), use gemma4:26b directly.

You should see the Claude Code interface load. Try a simple command to verify everything is connected:

> Read the current directory and list all files

Claude Code should use its file reading tool, call the local Gemma 4 model through Ollama, and return a formatted directory listing. If this works, your entire stack is live — Claude Code's full toolkit running against a free, private, local model.

Troubleshooting common issues:

  • "Connection refused" error: Ollama server isn't running. Open a separate terminal and run ollama serve.
  • Extremely slow responses: Your model is too large for your hardware. Drop to a smaller variant or increase GPU offloading with OLLAMA_NUM_GPU=99.
  • Claude Code crashes on long files: Context window is too small. Make sure you created the custom Modelfile with num_ctx 65536.
  • Tool calls failing: Confirm you're running Ollama 0.6.x or later. Earlier versions don't fully support the tool calling format Claude Code requires.

If you'd rather have someone build this setup from scratch — configured for your specific hardware, optimized for your workflow — I take on exactly these kinds of builds. You can see what I've done at fiverr.com/s/EgxYmWD.

What Actually Works — Real Coding Tasks I've Run

Setup guides are useless without honest performance reporting. I've been running this Gemma 4 + Ollama + Claude Code stack for over a week on real projects. Here's what it handles well and where it breaks down.

Frontend UI generation — strong. I asked the 26B model through Claude Code to scaffold a React dashboard with a sidebar, data table, chart component, and dark mode toggle. The output was clean. Proper component separation. Tailwind classes that made sense together. State management using React hooks without overcomplicating it. For prototyping and internal tools, this replaces my need to hit the API entirely.

File editing across multiple files — reliable. Claude Code's multi-file editing workflow — read a file, propose changes, apply them, run tests — works correctly through the Ollama bridge. The Gemma 4 26B model formats its tool calls properly, handles file paths without confusion, and applies surgical edits rather than rewriting entire files. I ran it against a Laravel project with 40+ files and it navigated the codebase without losing context.

Code refactoring — good with limits. I asked it to refactor a 300-line controller into service classes with dependency injection. The 26B model broke the logic into three services with correct interfaces and constructor injection. The naming conventions were reasonable. Where it stumbled: the test file it generated for one of the services had a minor namespace error. A two-second fix, but worth noting — cloud-hosted Claude Opus would have gotten this right.

Bash command generation and execution — excellent. One of Claude Code's most useful features is generating and running shell commands. Gemma 4 handles this confidently through Ollama. Git operations, npm commands, Docker management, file system manipulation — the model understands command-line workflows and generates correct commands for the operating system it's running on.

Complex multi-step agent workflows — here's the ceiling. When I set up a five-step pipeline — scrape a webpage, extract structured data, transform it, write it to a database, then generate a summary report — the 26B model completed the first four steps cleanly but got confused during the summary step, producing a report that referenced data from step two instead of step four. Running the same pipeline through the 31B dense model fixed the issue. This matches what I found in my full Gemma 4 review — the 26B model is exceptional for tasks with three or four reasoning steps but starts dropping accuracy on longer chains.

Multimodal tasks — a genuine surprise. Gemma 4 supports vision natively, and this works through the Ollama + Claude Code bridge. I fed it a screenshot of a Figma design and asked it to generate the corresponding HTML/CSS. It identified the layout structure, color palette, and typography choices with reasonable accuracy. Not pixel-perfect — but close enough that the output was a useful starting point rather than a blank canvas.

The pattern I've settled into: use the local Gemma 4 setup for 80% of my coding tasks — file edits, scaffolding, refactoring, command generation, quick prototypes. Switch to cloud-hosted Claude Opus for the remaining 20% that requires deep multi-step reasoning, complex architectural decisions, or handling codebases with intricate interdependencies.

The Honest Tradeoffs — What You Lose Going Local

I'd be doing you a disservice if I painted this as a straight replacement for Anthropic's cloud service. It's not. Here's what you give up.

Prompt caching doesn't work. Anthropic's prompt caching — which dramatically speeds up repeated conversations by caching the system prompt and early context — isn't available through the Ollama compatibility layer as of April 2026. Every request processes the full context from scratch. For short interactions this doesn't matter. For long coding sessions where you're building on 30+ turns of conversation, you'll notice the increasing latency as context grows.

tool_choice is unsupported. Claude Code sometimes uses tool_choice to force a specific tool call — like insisting the model must read a file before editing it. This parameter isn't supported in Ollama's Anthropic API compatibility mode. In practice, Gemma 4 still calls the right tools voluntarily most of the time, but occasionally the model will try to answer from memory when it should be reading the file. A minor annoyance, not a dealbreaker.

The reasoning ceiling is real. Gemma 4's 26B model scores 31 on the intelligence index I track across models. Qwen 3.5 scores 42. Claude Opus scores significantly higher. On tasks that require genuine novelty — designing an algorithm for a unique problem, catching subtle logical errors in complex business logic, making architectural decisions that account for eight different constraints — you'll feel the difference. The model gets you a strong first draft. Getting from that draft to production sometimes requires human refinement that cloud models handle automatically.

No streaming on some platforms. Depending on your Ollama version and operating system, streaming responses may not work perfectly. You might see the entire response appear at once instead of token by token. Functionally identical results — but the experience feels less interactive.

You're responsible for updates. When Anthropic updates Claude, you get the improvements automatically. With a local model, you need to manually pull new versions of Gemma 4 as Google releases quantization improvements, bug fixes, and fine-tuned variants. The community is active, but it's still a manual process.

None of these killed the workflow for me. The privacy, speed, and zero-cost advantages outweigh the limitations for the majority of my daily coding tasks. But go in with clear expectations.

Beyond Coding — What Else This Stack Handles

Once you have Gemma 4 running inside Claude Code through Ollama, you're not limited to writing code. The agentic framework supports any workflow you can express as a sequence of tool calls.

Automated email drafting. Connect Claude Code to your local file system where email templates live, describe the emails you need, and the agent generates personalized drafts. All local. No email content touching external servers.

Lead research and scraping. Claude Code's bash execution combined with Gemma 4's reasoning lets you build simple scraping pipelines. Pull data from public sources, extract structured information, format it for your CRM. I've set up scheduled Ollama prompts inside Claude Code that run these kinds of tasks on a timer — automated data collection without cloud dependencies.

Document analysis and summarization. Feed PDFs, markdown files, or code documentation through the pipeline and get structured summaries. The multimodal capability means you can even process screenshots and diagrams.

Slack and workspace integrations. Through MCP (Model Context Protocol) servers and Claude Code's tool ecosystem, you can connect your local Gemma 4 agent to Slack, Google Workspace, and other productivity tools. The model handles the reasoning; the tool connections handle the actions. Everything runs on your machine.

The common thread: any workflow where data privacy matters, where you want zero marginal cost per query, or where you need to run hundreds of automated requests without worrying about rate limits. This is where local models don't just match cloud services — they beat them.

What I'd Do Differently Setting This Up a Second Time

After a week of daily use, a few optimizations that would have saved me time on day one.

Set OLLAMA_NUM_GPU=99 from the start. This tells Ollama to offload as many model layers to the GPU as possible. I spent two days wondering why my 26B model was slower than expected before discovering that Ollama was running half the layers on CPU by default. One environment variable fixed it:

export OLLAMA_NUM_GPU=99

Create the 65K context Modelfile before your first Claude Code session. I started with Ollama's default context window — 8K or 16K depending on the model — and couldn't figure out why Claude Code kept losing track of files. The 65K minimum isn't optional. It's a requirement for Claude Code's agentic features to work correctly.

Keep a cloud fallback configured. I didn't delete my Anthropic API key — I created a simple shell alias that switches between local and cloud modes:

alias claude-local='ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_AUTH_TOKEN=ollama claude --model gemma4-claude'
alias claude-cloud='ANTHROPIC_BASE_URL=https://api.anthropic.com claude'

When the local model hits a wall on a complex task, I switch to cloud mode in two seconds. Best of both worlds.

Monitor your VRAM. If you're on a shared machine or running other GPU-heavy applications alongside Ollama, VRAM contention will silently degrade performance. On macOS, Activity Monitor shows unified memory usage. On Linux with NVIDIA, run nvidia-smi to check GPU memory allocation. If your model is competing for VRAM with a browser running GPU-accelerated video, you'll wonder why inference suddenly got three times slower.

The Bigger Picture — Why This Matters Beyond Free API Calls

Saving money on AI tokens is the obvious benefit. But after a week of this workflow, the thing I keep coming back to isn't cost.

It's control.

Every line of code I generate through this stack stays on my machine. Every project I analyze, every file I read, every command I execute — none of it touches an external server. For client work with NDAs, for proprietary codebases, for anything involving sensitive data, that's not a convenience feature. That's a compliance requirement being solved by architecture instead of legal agreements.

The speed is the second thing that surprised me. Without network latency — no round trip to a datacenter, no queuing behind other users' requests — response times are determined entirely by my hardware. During peak hours when cloud APIs slow down, my local setup stays the same speed. At 2 AM when I'm in a coding flow and burning through prompts, there's no rate limit throttling me.

And the scalability math is inverted. With cloud APIs, more usage equals more cost. With local inference, the cost is fixed — you already own the hardware. Whether you make 10 queries or 10,000, your electricity bill barely changes. For agentic workflows that chain dozens of tool calls per task, this makes architectures viable that would be absurdly expensive through cloud billing.

Google releasing Gemma 4 under Apache 2.0 — the most permissive open-source license available — removes the last legal barrier. No monthly active user caps like Meta's Llama license. No acceptable-use policy enforcement. Full commercial freedom. You can build products on this, ship them to customers, and owe nobody a licensing fee or usage report.

The future of AI-assisted development isn't choosing between cloud and local. It's running both — routing simple tasks to your local Gemma 4 instance for speed and privacy, escalating complex reasoning to Claude Opus or GPT when you need the frontier capability. This setup is that hybrid future, available today, working right now.

One command to pull the model. Three environment variables to connect it. Twenty minutes from reading this sentence to running a free AI coding agent on your own hardware.

The only question left is what you're going to build with it.

FAQ

Frequently Asked Questions

Everything you need to know about this topic

File reading, file editing, bash execution, and tool calling all work correctly as of April 2026. Prompt caching and tool_choice (forced tool selection) are not supported through Ollama's compatibility layer. For the full capability comparison, see the tradeoffs section above.

The 26B MoE model offers the best balance of speed and quality for most hardware. It activates only 3.88 billion parameters per inference call while delivering output quality close to the 31B dense variant. You need 16 GB of RAM minimum and should configure a 65K token context window.

On a MacBook Pro M4 Pro with 48 GB memory, the 26B model generates at roughly 51 tokens per second. An RTX 4090 pushes the 31B model to about 41 tokens per second. Cloud Claude is typically faster for raw throughput, but local inference eliminates network latency — first-token response time is often comparable.

The E4B model (4 billion parameters) runs on machines with 8 GB of RAM and handles basic coding tasks. For serious Claude Code workflows, you want the 26B model with 16 GB minimum. The E2B model runs on nearly anything but is too limited for meaningful agentic coding.

Gemma 4 is Apache 2.0 licensed — free for any use including commercial. Ollama is open source. Claude Code CLI is free to install. The only cost is your hardware and electricity. No API keys, no subscriptions, no usage tracking, no data leaving your machine.

Let's Work Together

Looking to build AI systems, automate workflows, or scale your tech infrastructure? I'd love to help.

Coffee cup

Enjoyed this article?

Your support helps me create more in-depth technical content, open-source tools, and free resources for the developer community.

Related Topics

Engr Mejba Ahmed

About the Author

Engr Mejba Ahmed

Engr. Mejba Ahmed builds AI-powered applications and secure cloud systems for businesses worldwide. With 10+ years shipping production software in Laravel, Python, and AWS, he's helped companies automate workflows, reduce infrastructure costs, and scale without security headaches. He writes about practical AI integration, cloud architecture, and developer productivity.

Discussion

Comments

0

No comments yet

Be the first to share your thoughts

Leave a Comment

Your email won't be published

7  +  15  =  ?

Continue Learning

Related Articles

Browse All

Comments

Leave a Comment

Comments are moderated before appearing.

Learning Resources

Expand Your Knowledge

Accelerate your growth with structured courses, verified certificates, interactive flashcards, and production-ready AI agent skills.

Sample Certificate of Completion

Sample certificate — complete any course to earn yours

Engr Mejba Ahmed

Engr Mejba Ahmed

Claude Code Expert · Online

👋

Hey there!

Quick Actions

WhatsApp Instant reply

Chat on WhatsApp

+880 1723 741224 · Instant reply

Popular Questions

Engr Mejba Ahmed is connected
Engr Mejba Ahmed is typing...
Engr Mejba Ahmed avatar

✉ Want me to follow up? Drop your email

Engr Mejba Ahmed avatar

📞 Connect Directly

Choose how you'd like to reach me

WhatsApp

+880 1723 741224

Email

[email protected]

✓ Details sent! I'll get back to you shortly.

Powered by OpenAI

335+

Blog Posts

25

AI Courses

63

Projects

Services & Expertise

Pricing & Process

Learning & Resources

Connect & Support