Skip to main content
📝 AI Models

Run Gemma 4 Locally With LM Studio (No Terminal)

I set up Gemma 4 in LM Studio on a mid-range PC and ran it through real work — meeting notes, whiteboard photos, coding. Here's the full walkthrough, settings

20 min

Read time

3,810

Words

Apr 19, 2026

Published

Engr Mejba Ahmed

Written by

Engr Mejba Ahmed

Share Article

Run Gemma 4 Locally With LM Studio (No Terminal)

Run Gemma 4 Locally With LM Studio (No Terminal)

My Wi-Fi cut out on a Tuesday afternoon, mid-sentence, while I was trying to turn a 42-minute meeting transcript into a list of action items. Claude Pro: dead. ChatGPT: a spinning tab. My day was officially on hold — except it wasn't, because fifteen seconds later Gemma 4 was chewing through the same transcript on my laptop with the airplane mode icon staring back at me from the menu bar. No cloud. No API key. No "your request could not be completed." Just a structured list of owners, deadlines, and follow-ups, generated by a model that sat on my SSD and asked nothing of the internet.

That was the moment I stopped treating local AI as a hobby project and started treating it as real infrastructure.

The piece that made it possible wasn't just Gemma 4 — Google's open model does the heavy lifting, sure, but the reason I actually had it installed and running in under ten minutes is LM Studio. No command line. No Python environments. No CUDA driver fights at 11pm. A desktop app. You click "download," you click "load," you start chatting. That's the whole setup.

I've been running this stack for a couple of weeks now across a MacBook and a mid-range Windows PC. It's not perfect — there are places where Claude and GPT still earn their keep, and I'll show you exactly where. But for a surprisingly large chunk of my daily workflow, Gemma 4 through LM Studio has quietly taken over.

Here's the full setup, the model size I actually settled on, the LM Studio features nobody talks about, and the three real tests I ran to stress this thing before I trusted it with client work.

Why Local AI Finally Matters in 2026

The AI industry spent three years training people to treat cloud models as the only serious option. Claude Opus, GPT-5.4, Gemini 3 — the frontier lives in somebody else's data center, you pay a subscription, you accept the terms of service, and that's the deal.

That deal has three cracks in it, and all three got wider this year.

The first is cost. I was running roughly $180/month across Claude Pro, ChatGPT Plus, and a Cursor seat, plus API credits for agentic experiments that ate through $20 in an afternoon when a loop went sideways. For a working engineer, that's fine. For a student, a side-hustler, or someone running twenty agents in parallel? It adds up faster than it should.

The second is privacy. Every prompt I send to a cloud model is a document leaving my machine. For most of my work that's acceptable. For client contracts, medical forms I'm helping a family member understand, half-finished code that shouldn't be sitting in a training pipeline — it's genuinely not.

The third is availability. Cloud APIs go down. Rate limits hit at the worst moment. Your internet drops. I wrote a whole post on why I stopped waiting for perfect AI tools and started building with what works offline, and local inference has been the single biggest reliability win of the past quarter.

Gemma 4 matters because it's the first open model where I don't feel like I'm making a compromise to run locally. Google released it on April 2, 2026 under an Apache 2.0 license — genuinely open, commercially usable, no strings. The 26B Mixture of Experts variant ranks sixth on the Arena AI leaderboard among all open models. The 31B dense variant ranks third. These aren't "pretty good for free" numbers. These are "beats models twenty times their size" numbers, according to Google's own benchmark release and independent testing that's followed.

And LM Studio is what turns that from a research paper into something you actually use.

Before we get to the install, there's one thing worth understanding about which Gemma 4 variant to pick — because picking wrong is the single most common mistake I see people make.

The Four Gemma 4 Sizes — And Why I Run the 4B Model Most Days

Gemma 4 ships as four distinct models, each tuned for a different class of hardware. Running the wrong size is the difference between "wow, this is fast" and "why is my laptop fan screaming."

Model Total Params Active Params Context Where It Runs
E2B 2B 2B 128K Phones, Raspberry Pi, low-RAM laptops
E4B 4B 4B 128K Most mid-range laptops and desktops
26B MoE 26B ~3.8B 256K 32GB+ RAM machines, Mac Studio, gaming PCs
31B Dense 31B 31B 256K High-VRAM GPUs, workstations, cloud deploys

The shortest honest answer to "which one should I use" is: start with the 4B. That's the one I default to, that's the one I reach for first when I'm helping someone set this up, and it's the one that Kevin's original tutorial video wisely recommends for most PCs.

Here's why. The 4B model gives you roughly 90% of what the 26B gives you for common tasks — summarization, structured extraction, question answering, moderate coding help — at a fraction of the memory footprint. On my MacBook Pro (M3 Pro, 18GB unified memory) the 4B runs at roughly 45-60 tokens per second. Fast enough that I forget I'm not on the cloud.

The 26B MoE is where things get interesting if you have the RAM. Because only about 3.8 billion parameters activate per token — that's the "Mixture of Experts" trick — it runs dramatically faster than a traditional 26B dense model would. LM Studio reports it streaming at roughly 15-25 tokens per second on a well-equipped gaming PC. Quality jumps noticeably on reasoning-heavy tasks. But it wants at least 32GB of system RAM, and if you don't have it, LM Studio will spill to disk and grind.

The 2B model is what I run on an older Windows laptop I keep around for travel. Honestly? For quick summarization and formatting tasks, it's fine. You'll feel the quality drop on anything that needs reasoning, but for "turn this wall of text into bullets," it gets the job done.

The 31B dense is for people with serious GPUs — a 24GB VRAM card minimum, realistically a 48GB setup if you want the full 256K context at decent speeds. Most readers are not that person. If you are, you already know.

My recommendation: install the 4B, use it for a week, then decide if you need more. Most people don't.

With that out of the way, let's actually install this thing.

Installing LM Studio in Under Five Minutes

LM Studio is a desktop app available at lmstudio.ai. Mac, Windows, and Linux all supported. The download is in the neighborhood of 500MB — not small, but it's a one-time hit.

Step 1 — Download and Install

Go to the LM Studio site, click the download button for your platform. On Mac, you drag the app to Applications. On Windows, you run the installer. On Linux, there's an AppImage that Just Works if you make it executable.

First launch takes roughly ten seconds. The app opens to a dark-themed interface with a search bar front and center and a left sidebar for chats, models, and settings. If you've ever used a modern chat app, nothing here will surprise you.

LM Studio will ask if you want to enable developer mode. For now, say no. You don't need it. Developer mode exposes the local API server and advanced inference settings — powerful but noisy if you're just trying to chat with a model.

Step 2 — Search for Gemma 4 and Pick Your Size

Click the magnifying glass icon (or press Cmd/Ctrl+K) to open the model search. Type "Gemma 4."

You'll see a list of Gemma 4 variants. This is where the naming gets a little intimidating — you'll see things like google/gemma-4-4b-it-GGUF and google/gemma-4-26b-a4b-MLX. Two things to understand:

  • GGUF is the format used by llama.cpp. Works on every platform. This is your default.
  • MLX is Apple's framework. Faster on Apple Silicon Macs specifically. If you're on an M1/M2/M3/M4 Mac, prefer the MLX version when available.

The suffix like -4b-it means "4 billion parameters, instruction-tuned." Always pick the instruction-tuned variant for chat. The base models are for researchers fine-tuning their own systems — they'll feel weirdly non-conversational if you try to use them directly.

For most readers, the right click is: google/gemma-4-4b-it-GGUF on Windows/Linux, or google/gemma-4-4b-it-MLX on Mac.

LM Studio also shows you a quantization selector — Q4_K_M, Q5_K_M, Q8_0, and so on. The number refers to bits of precision. Lower bits = smaller file, faster inference, slightly worse quality. For 99% of users, Q4_K_M is the right default. It's the accepted sweet spot across the local AI community, and I've run side-by-side tests against Q8_0 where I genuinely could not tell the difference on real tasks.

Click download. The 4B model at Q4_K_M is roughly 2.5GB. On a decent connection you're looking at a two-minute wait.

Step 3 — Load the Model

Once downloaded, head to the chat view (the speech bubble icon, top left). At the top of the chat window, there's a model selector. Click it, pick your freshly downloaded Gemma 4, and hit load.

Loading takes anywhere from five seconds on a fast SSD to thirty on a slower laptop. LM Studio shows you memory usage as it loads. On my MacBook Pro, the 4B Q4_K_M eats about 3.2GB of RAM when loaded. Modest.

You'll also see a prompt asking whether to enable GPU offloading. Say yes. LM Studio auto-detects your GPU and sends as many layers as fit. For a 4B model, every layer fits. For larger models, this is where the app earns its keep — it'll tell you "32/41 layers on GPU" and automatically split the rest to CPU if needed.

And now you're chatting with Google's Gemma 4, running entirely on your laptop, with your internet connection technically optional.

This is the part of most tutorials where writers hand you a "Hello, world" prompt and call it a day. I'm going to do something more useful — show you the three real tests I ran before I trusted this setup with actual work.

The Three Tests That Convinced Me Gemma 4 Is Production-Ready

Local AI lives or dies on whether it can handle the work you'd otherwise give a cloud model. Benchmarks are one thing; "does it survive my Tuesday" is another.

Test 1 — Meeting Notes to Action Items

I grabbed a real meeting transcript from a recent client call. 2,800 words, four participants, a messy mix of decisions, tangents, and half-finished ideas. The kind of document where humans reach for AI specifically because reading through it manually is miserable.

I pasted it into LM Studio and used a prompt I use every day with Claude:

Extract action items from this transcript. For each, give me the owner, the deadline (or "none stated" if not mentioned), and the one-sentence context. Return as a markdown table.

Gemma 4 4B produced a clean, structured table with seven action items. Owners correctly attributed. Deadlines pulled accurately when stated. Context tight and useful. The one miss — a hedged comment about "maybe getting Priya involved by end of Q2" — Gemma attributed to Priya as an owner, which was arguably wrong. Claude Opus 4.5 caught that same nuance correctly on the same transcript.

But here's the kicker: I ran this five separate times on different transcripts. Gemma 4 got the structural output right every single time. For 90% of meeting-notes work, which is mostly mechanical extraction rather than nuanced judgment, it's fully sufficient.

The "think mode" toggle in LM Studio — a feature I almost missed on my first pass through the UI — turned out to be the differentiator for this task. When you enable reasoning mode on Gemma 4 (there's a thinking-brain icon in the chat input area), the model runs through a multi-step reasoning pass before producing its final output. It's slower — maybe 2-3x longer response time — but the quality jump on anything involving multi-step inference is genuinely noticeable.

For a simple extraction task, skip think mode. For "figure out what these four people are actually disagreeing about under the surface," enable it. That's the rule I've settled on.

Test 2 — Whiteboard Photo to Structured Notes

This is the test that surprised me most. Gemma 4 is multimodal out of the box — it handles image input natively, not as a bolt-on.

I took a photo of a whiteboard from a brainstorming session. Bad lighting, my terrible handwriting, a mess of arrows and abbreviations. I dragged the image into LM Studio's chat window (yes, you can just drag and drop), asked for "a summary plus a list of takeaways I can share with the team," and watched the model work.

It nailed the structure. It even correctly interpreted a poorly-drawn flowchart as "three-stage user onboarding with a branching decision at step two." One abbreviation got misread — "CR" as "Customer Relations" instead of "Code Review," which was a context-dependent judgment Gemma had no way to know. I edited that manually in about four seconds.

What I want to flag here: you need to pick a Gemma 4 variant that supports vision for this to work. Not every quantization in LM Studio includes the vision encoder. Look for model cards that explicitly say "multimodal" or include the image icon in LM Studio's model list. On the 4B variants, this is standard; on some community re-quantizations, vision was stripped to save space.

Test 3 — Code Review on a Real PR

I fed Gemma 4 a 340-line TypeScript PR from one of my Next.js projects. The prompt: "Review this code. Flag bugs, security issues, and architectural concerns. Be direct."

Gemma 4 caught four real issues. One genuine security concern (a missing input validation on an API route that accepted user-supplied IDs). Two legitimate code-quality improvements. One pedantic style comment I disagreed with.

It missed two things Claude Sonnet 4.7 flagged on the same PR — a subtle race condition in a pair of async calls, and a type narrowing issue that Claude correctly traced through three files.

Here's my honest take: for day-to-day code review, Gemma 4 4B is competent. For complex cross-file reasoning, the cloud frontier models are still measurably better. This isn't surprising — the cloud models are 50-100x larger, and they show it on deep reasoning tasks. But "competent enough for 80% of what I ask" running on my laptop for free is a genuinely new category.

Now, the LM Studio features that made this workflow actually pleasant.

The LM Studio Features I Actually Use Every Day

Most local AI tutorials focus on the install and stop. That's a mistake. LM Studio has a handful of features that, once you find them, turn it from "a chat window to a local model" into "a genuinely good daily AI interface." Here are the ones I lean on.

Branching

This is the killer feature and almost nobody mentions it. In any chat, you can branch from any message — create a new thread that picks up from that point without losing the original. The three-dot menu on any assistant response has a "branch" option.

Why it matters: when I'm exploring a problem with Gemma 4, I frequently want to try three different angles from the same setup. Branching lets me keep the full context and try each approach as a separate thread. Claude and ChatGPT both have similar features, but LM Studio's implementation is cleaner — the left sidebar shows branches as nested threads under their parent.

Folders and Organization

The chat sidebar supports folders. I use four: "Work," "Writing," "Code," "Experiments." Everything gets filed. A month in, I can find any conversation in seconds. If you've ever lost a ChatGPT thread because their UI has no real search, this alone is worth the switch for local work.

Split View

Two chats, side by side. I use this constantly for comparing outputs — feed the same prompt to Gemma 4 4B and Gemma 4 26B, watch the responses stream in parallel, see what the size difference buys you. Also useful for "write this email in two different tones and let me pick."

Custom Instructions Per Chat

Each chat can carry its own system prompt. Mine for code review: "You are a senior engineer. Be direct. Point out bugs first, style second. Always format code suggestions as complete blocks, not inline fragments." Mine for writing: "You reply in bullet points only. No preamble. No sign-offs." Set once per chat type, saved forever.

You can also set a global default system prompt in settings, which becomes your baseline personality across all new chats.

Regenerate, Edit, Delete

Standard controls, but the edit function is more useful than most users realize. If Gemma goes off track three messages deep, don't start a new chat — edit the message where the drift began, regenerate from there. The context stays clean and the model recovers.

If you've made it this far, you already have a better local AI setup than 95% of people running Claude Desktop. The next section is where it gets really powerful.

Real Talk — Where Gemma 4 Falls Short (And When to Reach for Cloud Models)

No article about a new tool is honest without the part where the tool loses.

Long-context reasoning. Gemma 4 technically supports 128K-256K tokens depending on variant. In practice, reasoning quality degrades noticeably past about 32K tokens of input. Cloud models like Claude Sonnet 4.7 with 1M context handle deep document analysis at scales Gemma can't match. If you're doing "read this entire codebase and find the architectural problem," use the cloud.

Deep coding reasoning. I already showed this — the 26B MoE closes some of this gap, but frontier cloud models still win on complex multi-file bug hunts, API design discussions, and anything involving implicit cross-file dependencies.

Current information. Gemma 4 has a knowledge cutoff. No web search. No "what's the current price of X." For anything requiring fresh data, you need cloud models with web search or an agent stack that handles retrieval.

Agentic workflows with tools. Gemma 4 supports function calling and structured outputs natively — this is a real strength — but for complex agent loops with many tools, LM Studio's local API works but isn't as refined as the full Anthropic or OpenAI agent ecosystems yet.

The honest framing: local AI via Gemma 4 handles roughly 70% of what I used to send to the cloud. The remaining 30% is where the frontier still matters. That 70% running for free, offline, and private is still an enormous shift.

I wrote a related piece on Qwen 3.6's agentic coding strengths that explains which open model I reach for when I want specifically agentic capability rather than general chat. The short version: Gemma 4 for chat and multimodal, Qwen for agent pipelines.

What I'd Do Differently If I Were Setting This Up From Scratch Today

Three things I wish I'd known on day one.

First, check your RAM before you pick a model. On Mac, hit "About This Mac." On Windows, open Task Manager → Performance → Memory. If you have 8GB, use the 2B model. 16GB: 4B is your sweet spot. 32GB+: try the 26B MoE. LM Studio will let you try to load a model too big for your machine and it will be miserable. Don't.

Second, enable auto-update for LM Studio. The app ships updates roughly every two weeks and each one brings meaningful improvements — inference speed, new model support, UI polish. Settings → Preferences → enable auto-update. Don't fight it.

Third, set up at least one global custom instruction. My default: "Reply concisely. Use bullet points when you have more than two items. Never apologize. Never ask clarifying questions unless absolutely necessary — make a reasonable assumption and state it." Ten minutes of setup, permanent improvement to every chat.

FAQ

Frequently Asked Questions

Everything you need to know about this topic

For the 4B model at Q4_K_M quantization, you need 16GB of RAM, roughly 3GB of free disk space, and any GPU with 6GB+ VRAM (or Apple Silicon). You can run the 2B model on 8GB RAM machines. The 26B MoE wants 32GB RAM minimum. See the "Four Gemma 4 Sizes" section above for full breakdown.

LM Studio is free for personal and commercial use as of April 2026, and Gemma 4 itself is released under Apache 2.0, which explicitly permits commercial deployment. You can legally build products on this stack without paying anything. Check the LM Studio terms of service for edge cases, but the core "use it for work" answer is yes.

Yes, most Gemma 4 variants available through LM Studio are multimodal out of the box — drag and drop an image into the chat window and the model will process it. Confirm the model card mentions "multimodal" or "vision" before downloading, as some community quantizations strip the vision encoder to save space.

Gemma 4 ranks third and sixth on the Arena AI leaderboard among open models with its 31B and 26B MoE variants respectively — directly competitive with Llama and Qwen's top open releases. I personally prefer Gemma 4 for multimodal and chat, and Qwen 3.6 for agentic coding. See the "Real Talk" section for the full nuance.

Yes, completely. Once the model file is downloaded to your machine, LM Studio runs inference entirely locally with zero network calls. You can run it in airplane mode, on a flight, or with your Wi-Fi unplugged. This is the actual point of the whole setup.

Your Next Ten Minutes

If you've read this far, you're already more informed about local AI than most engineers I talk to. But reading about it isn't the point. Installing it is.

Here's the smallest possible commitment that gets you real value: download LM Studio, install Gemma 4 4B Q4_K_M, paste in one real document from your work today, and see what happens. Ten minutes, start to finish. That's it.

You'll know within the first response whether this stack belongs in your daily workflow. I knew during that Wi-Fi outage on Tuesday afternoon — the moment Gemma 4 handed me a clean list of action items without asking me for an API key or an internet connection, the question stopped being "should I try local AI" and became "why did I wait this long."

The cloud isn't going anywhere. Claude and GPT will keep earning their subscription fees for the hardest 30% of my work. But the other 70% — the steady, unglamorous, daily-grind AI tasks that used to quietly drain my API budget — is running on a model that lives on my SSD and costs me nothing per prompt.

Tuesday afternoon, when the Wi-Fi came back, I left Gemma 4 running anyway. That's when I knew the setup had won.

Let's Work Together

Looking to build AI systems, automate workflows, or scale your tech infrastructure? I'd love to help.

Coffee cup

Enjoyed this article?

Your support helps me create more in-depth technical content, open-source tools, and free resources for the developer community.

Related Topics

Engr Mejba Ahmed

About the Author

Engr Mejba Ahmed

Engr. Mejba Ahmed builds AI-powered applications and secure cloud systems for businesses worldwide. With 10+ years shipping production software in Laravel, Python, and AWS, he's helped companies automate workflows, reduce infrastructure costs, and scale without security headaches. He writes about practical AI integration, cloud architecture, and developer productivity.

Discussion

Comments

0

No comments yet

Be the first to share your thoughts

Leave a Comment

Your email won't be published

13  +  15  =  ?

Continue Learning

Related Articles

Browse All

Comments

Leave a Comment

Comments are moderated before appearing.

Learning Resources

Expand Your Knowledge

Accelerate your growth with structured courses, verified certificates, interactive flashcards, and production-ready AI agent skills.

Sample Certificate of Completion

Sample certificate — complete any course to earn yours

Engr Mejba Ahmed

Engr Mejba Ahmed

Claude Code Expert · Online

👋

Hey there!

Quick Actions

WhatsApp Instant reply

Chat on WhatsApp

+880 1723 741224 · Instant reply

Popular Questions

Engr Mejba Ahmed is connected
Engr Mejba Ahmed is typing...
Engr Mejba Ahmed avatar

✉ Want me to follow up? Drop your email

Engr Mejba Ahmed avatar

📞 Connect Directly

Choose how you'd like to reach me

WhatsApp

+880 1723 741224

Email

[email protected]

✓ Details sent! I'll get back to you shortly.

Powered by OpenAI

335+

Blog Posts

25

AI Courses

63

Projects

Services & Expertise

Pricing & Process

Learning & Resources

Connect & Support