Skip to main content
📝 Claude Code

I Made Claude Code Talk Like a Cave Man. It Got Smarter.

I tested the caveman skill that forces Claude Code to drop filler words. The token savings were modest.

18 min

Read time

3,437

Words

Apr 07, 2026

Published

Engr Mejba Ahmed

Written by

Engr Mejba Ahmed

Share Article

I Made Claude Code Talk Like a Cave Man. It Got Smarter.

I Made Claude Code Talk Like a Cave Man. It Got Smarter.

The GitHub repo had 589 stars and a tagline that read like a joke: "why use many token when few token do trick." I almost closed the tab. I'd been deep in an Opus 4.6 workflow — agents running across four projects, token bills climbing toward numbers I didn't want to think about — and the last thing I needed was a meme tool promising to fix my costs.

Then I read the claim: 75% reduction in output tokens.

That number stopped me. Not because I believed it — I didn't — but because if even a fraction of it held up, I was leaving serious money on the table every single day. So I installed it. And what happened next wasn't what I expected. The token savings were real but modest. The part that genuinely surprised me? Claude started giving me better answers.

Not "better" in some hand-wavy, subjective way. Better in a way that a March 2026 research paper from arXiv had already predicted. A paper that evaluated 31 models across 1,485 problems and found something that breaks most people's intuition about how large language models work.

I need to walk you through what actually happened — the real numbers, the real savings, and the science that explains why making your AI talk like a Neanderthal might be one of the smartest things you can do with it.


What the Caveman Skill Actually Does

The caveman skill, created by indie developer Julius Brussee, is a Claude Code skill that strips Claude's responses down to their bare essentials. No articles. No filler words. No pleasantries. No hedging. Fragments instead of full sentences. Technical terms stay exact. Code blocks stay untouched.

Here's what the difference looks like in practice.

Normal Claude response: "Your component is re-rendering because you're creating a new object reference on each render cycle. The inline object prop generates a new reference every time the parent re-renders, which causes the child component to re-render as well. I'd recommend wrapping the object in useMemo to maintain referential stability."

Caveman full mode: "New object ref each render. Inline object prop = new ref = re-render. Wrap in useMemo."

Caveman ultra mode: "Inline obj prop -> new ref -> re-render. useMemo."

Same information. Same technical accuracy. Dramatically fewer tokens.

Installation takes one command:

npx skills add JuliusBrussee/caveman

You activate it by typing /caveman in your Claude Code session, or phrases like "talk like caveman" or "less tokens please." To go back to normal, type stop caveman or normal mode. Three intensity levels — lite, full, and ultra — let you dial the compression to your comfort level.

The skill also ships with a companion tool that compresses your memory files (like CLAUDE.md) by roughly 45%, cutting input tokens on every single interaction. If you've read my breakdown of why your CLAUDE.md file is either your superpower or your bottleneck, you know how much those persistent context tokens add up.

But here's where I need to be honest with you — because the headline numbers don't tell the story most people think they do.


The Token Math Most People Get Wrong

The caveman repo claims up to 75% reduction in output tokens and 45% reduction in input tokens. Those numbers are technically accurate. They're also deeply misleading if you don't understand what they're actually measuring.

I ran a full audit of my own Claude Code sessions to figure out where tokens actually go. Here's the breakdown that changed my perspective.

A typical Claude Code session runs roughly 100,000 tokens total. That splits into about 75,000 input tokens and 25,000 output tokens. Already, most people's mental model is wrong — they assume output is the big cost driver, but input tokens outnumber output tokens 3:1 in a normal coding session.

Now look at what makes up those 25,000 output tokens:

  • Tool calls — when Claude reads files, searches your codebase, runs commands. These are structured JSON payloads. Caveman mode doesn't touch them.
  • Code blocks — the actual code Claude writes. Caveman mode leaves these completely intact (and it should — you don't want compressed variable names).
  • Prose responses — Claude's explanations, suggestions, and commentary. This is the only part caveman mode compresses.

In my sessions, prose responses accounted for roughly 6,000 of those 25,000 output tokens. Caveman mode compressed those 6,000 tokens by about 75%, saving approximately 4,500 tokens.

4,500 tokens out of 100,000 total.

That's a 4.5% reduction per session. Not 75%.

On the input side, the memory file compression saves around 1,000 to 2,000 tokens per session. The rest of your input tokens — conversation history, file contents Claude reads, system prompts — remain unchanged.

Combined realistic savings: roughly 4-5% total token reduction per session.

What's Measured Claimed Reduction Actual Impact on Total Session
Prose output tokens ~75% ~4.5% (prose is 6K of 25K output)
Memory file input tokens ~45% ~1-2% (small portion of 75K input)
Total session tokens Not claimed ~4-5% combined
Code blocks 0% Unchanged (correctly so)
Tool calls 0% Unchanged (structured data)

Is 4-5% worthless? Absolutely not. If you're running Claude Code eight hours a day across multiple projects — and I am — that compounds. On my roughly $200/month usage, it shaves off $8-10 monthly. Not transformational, but it's a free optimization with zero effort after installation.

But the cost savings aren't why I kept caveman mode running. The reason I kept it has nothing to do with tokens at all.


The Research Paper That Changed My Mind

In March 2026, a paper hit arXiv that I initially skimmed past: "Brevity Constraints Reverse Performance Hierarchies in Language Models". The title sounded academic and dry. The findings were anything but.

The researchers evaluated 31 open-weight models ranging from 0.5 billion to 405 billion parameters across 1,485 problems spanning five benchmark datasets. Their question was simple: does model size always equal better performance?

The answer broke my assumptions.

On 7.7% of benchmark problems, larger models underperformed smaller ones — by up to 28.4 percentage points. A 2 billion parameter model beating a 400 billion parameter model. Not on some edge case trick question. On standard mathematical reasoning and scientific knowledge benchmarks.

The mechanism they identified has a name that stuck with me: spontaneous scale-dependent verbosity.

Larger models, trained extensively through reinforcement learning with human feedback (RLHF), develop a tendency to be excessively verbose. They don't just answer the question — they elaborate, hedge, qualify, explore tangents, and add disclaimers. This verbosity isn't harmless padding. It actively introduces errors through what the researchers call "overelaboration."

Think about it this way. When you ask a large model to solve a math problem, it doesn't just compute the answer. It narrates its entire reasoning process, often more verbosely than necessary. Somewhere in that narration — in the hedging, the alternative considerations, the "but we should also consider" asides — the model can talk itself into the wrong answer. It overthinks. The smaller model, constrained by its capacity, gives a shorter, more direct response and lands on the correct answer more often.

Here's the finding that made me sit up straight: constraining large models to produce brief, concise responses improved accuracy by 26 percentage points on those problematic benchmarks. Even more striking — it reduced the performance gap between large and small models by up to two-thirds.

The large models weren't less capable. They were too verbose to use their capabilities effectively.


Why RLHF Trains Models to Hurt Themselves

This part of the research rabbit hole got dark. The verbosity problem isn't a bug in the training process — it's a predictable outcome of how these models learn to communicate.

RLHF works by having human annotators rate model responses. Consistently, across multiple studies, human annotators conflate length with quality. A longer, more detailed response feels more thorough, more helpful, more impressive. So the reward model learns: longer equals better. And the language model optimizes for that signal.

Research from OpenReview documents this systematic length bias — improvements in reward scores are largely driven by increasing response length rather than actual answer quality. The model gets rewarded for being wordy, not for being right.

Larger models, with their greater capacity, internalize this signal more deeply than smaller ones. They have more parameters to devote to generating elaborate, flowing prose. So they become more verbose as they scale — exactly the opposite of what you'd want.

There's an even more unsettling finding from separate research: RLHF may make models better at convincing humans they're right, even when they're wrong. The verbose, confident, well-structured response that sounds authoritative? It might be confidently incorrect — and the verbosity is what makes it convincing.

When I read that, my entire relationship with model output changed. I stopped equating thoroughness with accuracy. And I started wondering: what if the best thing I could do for my Claude Code workflow wasn't giving it more context, more instructions, more freedom — but less?


My Testing Setup: Caveman vs. Normal Mode

I needed to test this myself. Not with benchmarks — with real work. My daily Claude Code workflow involves building features, debugging production issues, writing agent systems, and generating content across four websites. If brevity constraints actually improved output quality, I'd see it in the code that ships.

I ran a loose A/B test over two weeks. Week one: normal Claude Code, Opus 4.6, my standard CLAUDE.md configuration. Week two: same setup with caveman mode (full intensity) activated.

I tracked three things:

  1. First-attempt success rate — did Claude's first response solve the problem without needing a follow-up correction?
  2. Total turns per task — how many back-and-forth messages before a task was complete?
  3. Code review pass rate — when I reviewed the generated code, how often did it pass without changes?

The results weren't dramatic enough to publish as a controlled study, but the pattern was consistent.

First-attempt success rate: Normal mode averaged around 64%. Caveman mode hit roughly 71%. That 7-percentage-point improvement maps closely to what the research predicted — constrained verbosity reduces error introduction.

Total turns per task: Normal mode averaged 4.2 turns. Caveman mode averaged 3.6 turns. Fewer turns meant faster task completion and lower total token consumption (which compounds the direct token savings).

Code review pass rate: Nearly identical — 82% normal vs. 84% caveman. The code output itself wasn't affected, which makes sense. Caveman mode doesn't touch code blocks.

The real surprise was qualitative. In caveman mode, Claude's explanations were easier to parse. When something went wrong, the terse error description pointed me to the issue faster than a three-paragraph explanation would have. When Claude explained a technical decision, the stripped-down version exposed the reasoning more clearly — no hedging language to soften a questionable choice.

It's counterintuitive. Less explanation, better understanding.


How to Set Up Caveman Mode (and When Not To)

If you want to try this yourself, here's the practical setup. I'll also tell you where caveman mode falls flat — because it does, in specific situations.

Step 1: Install the Skill

npx skills add JuliusBrussee/caveman

This works with Claude Code and 40+ other AI agents including Cursor, Windsurf, and GitHub Copilot. The skill installs to your project's .skills directory and Claude auto-detects it.

Step 2: Activate and Choose Your Level

In any Claude Code session:

/caveman lite    # Drops filler but keeps readable sentences
/caveman full    # Default — fragments, no articles, minimal words
/caveman ultra   # Absolute minimum. Borderline telegraph.

My recommendation: start with full. It's the sweet spot between compression and readability. Ultra is useful for repetitive tasks where you know exactly what to look for. Lite is barely different from normal mode — save it for documentation-heavy work where you need complete sentences.

Step 3: Compress Your Memory Files

The companion tool compresses your CLAUDE.md and other memory files:

# From within the caveman skill directory
npx caveman compress

This strips filler from your persistent context files while preserving all rules and technical constraints. If you've been following my advice about keeping CLAUDE.md under 300 lines, the compression brings it down another 45%.

Pro tip: Before compressing, back up your original CLAUDE.md. The compressed version is harder for humans to read and edit. I keep a CLAUDE.md.human copy for when I need to make manual changes, then re-compress after editing.

Step 4: Deactivate When Needed

stop caveman

or

normal mode

This brings Claude back to its standard verbosity level immediately.

When Caveman Mode Hurts

I've identified three scenarios where I consistently deactivate caveman:

Explaining concepts to collaborators. When I'm using Claude to generate explanations I'll share with teammates or clients, the caveman output is too terse. People who aren't in your head need the connecting tissue that caveman strips out.

Debugging complex multi-file issues. When a bug spans multiple files and Claude needs to walk through its reasoning chain, the compressed output can obscure critical decision points. I want to see why Claude chose to look in file A instead of file B. Caveman mode sometimes hides that reasoning.

Writing documentation. This should be obvious, but I've made the mistake. Claude in caveman mode generating API docs produces technically accurate but almost useless documentation. Complete sentences matter when you're writing for humans who won't have your context.

For straight coding tasks, refactoring, test writing, code review, and any task where the output is primarily code? Caveman mode stays on. Always.


The Bigger Principle: Why Conciseness Beats Verbosity for LLMs

Here's what I think most people miss about the caveman approach, and it's the insight that matters long after this specific tool is forgotten.

The verbosity problem isn't unique to one skill or one tool. It's embedded in how every major language model is trained. Claude, GPT, Gemini, Grok — they all suffer from the same RLHF-induced verbosity bias documented by Unite.AI's analysis and the emerging consensus around verbosity compensation behavior.

Models trained with RLHF, DPO, or supervised fine-tuning on long chain-of-thought traces routinely compensate for uncertainty by generating unnecessarily lengthy, redundant, or circuitously reasoned responses. When the model isn't sure about something, its trained instinct is to write more, not less. More hedging. More alternatives. More caveats. Each additional word is another opportunity to introduce an error or talk itself out of the correct answer.

This means every prompt you write, every system instruction you configure, every CLAUDE.md rule you set — all of it can either fight the verbosity bias or amplify it.

You don't need the caveman skill to apply this principle. You can get 80% of the benefit with a single line in your CLAUDE.md:

Be concise. No filler. No hedging. State conclusions first, reasoning second. Skip pleasantries.

I've tested this exact instruction against caveman mode. The token savings are smaller (roughly 40-50% prose compression vs. caveman's 75%), but the accuracy benefits are nearly identical. The key isn't the specific phrasing — it's the constraint itself. Telling the model to be brief forces it to commit to answers rather than hedging its way through them.

If you'd rather have someone build a complete token-optimized Claude Code setup from scratch — including custom CLAUDE.md configurations, model routing, and agent cost optimization — I take on exactly those kinds of projects. You can see what I've built at fiverr.com/s/EgxYmWD.

For developers already deep in the AI agent cost optimization game, this is another lever to pull. It stacks with model selection strategies and context management techniques. Combined, these approaches can cut your monthly AI development costs by 60-70% without touching output quality.


The Numbers That Actually Matter

Let me frame the full picture for someone running Claude Code daily, the way I do.

Direct token savings from caveman mode: ~4-5% per session. At $200/month usage, that's $8-10 monthly. Not life-changing, but free after a one-minute install.

Indirect savings from fewer conversation turns: My sessions averaged 0.6 fewer turns per task in caveman mode. Across 30-40 tasks daily, that's 18-24 fewer turns. Each turn costs roughly 2,000-3,000 tokens. That's another 36,000-72,000 tokens saved daily — pushing total savings closer to 8-10%.

Accuracy improvement: 7 percentage points higher first-attempt success rate in my testing, consistent with the 26-percentage-point improvement found in controlled research benchmarks. The gap is smaller in real-world coding because code tasks have tighter constraints than open-ended benchmarks — but the direction is clear and consistent.

Time saved: Fewer turns and terser explanations meant I spent less time reading Claude's output. Hard to quantify precisely, but I estimate 15-20 minutes saved across a full working day. That's nearly two hours per week I get back just from reading less padding.

Combined monthly impact: Roughly $15-20 in token savings, 7+ hours of time reclaimed, and measurably fewer correction cycles. For a one-command install with zero configuration.

Those aren't the kind of numbers that make headlines. They're the kind that compound over months and quietly change the economics of running AI-assisted development at scale.


What I'm Watching Next

The caveman skill is a blunt instrument — effective, but crude. What interests me more is where this research direction leads.

Anthropic and OpenAI are both aware of the verbosity problem. Anthropic's own documentation on managing Claude Code costs already recommends concise prompting as a primary cost lever. But the model-level fix — training models that default to concise responses unless explicitly asked for detail — hasn't shipped yet.

I expect we'll see it within 2026. The research is too clear to ignore. When a training approach demonstrably reduces accuracy by encouraging verbosity, fixing it at the model level becomes an economic imperative. The model that naturally gives concise, accurate responses without needing a caveman skill overlay will have a genuine competitive advantage.

Until then, the skill approach — adding a constraint layer that counteracts the verbosity bias — is the best tool we have. And it works. Not as dramatically as the headline numbers suggest. But meaningfully, consistently, and with zero downside for coding workflows.

There's one more thing the caveman experiment taught me that goes beyond tokens and accuracy. It changed how I read AI output.

Before caveman mode, I'd scan Claude's long responses looking for the actual answer buried in the explanation. I'd skim past the hedging, skip the caveats, jump to the code block. I was unconsciously filtering for signal in a sea of noise. And I didn't even realize how much cognitive energy that filtering consumed.

After two weeks of terse, direct responses, going back to normal mode felt noisy. Like switching from a clean terminal to a cluttered IDE with every panel open. The information was the same. The experience was worse.

That's the real lesson buried inside a meme-worthy GitHub repo. We've been training language models to sound impressive when we should have been training them to sound clear. The caveman skill is a hack — a funny, useful, well-built hack. But the principle behind it is dead serious.

The most expensive token isn't the one you pay for. It's the one that introduces an error you spend twenty minutes debugging.

So here's my challenge: add one line to your CLAUDE.md today. Just one. "Be concise. No filler. No hedging." Run your normal workflow for a week. Watch what happens to your first-attempt success rate.

I think you'll keep that line permanently.


FAQ

Frequently Asked Questions

Everything you need to know about this topic

No. Caveman mode only compresses prose responses — explanations, suggestions, and commentary. All code blocks, tool calls, and structured outputs remain completely unchanged. The skill explicitly preserves technical terms, error messages, and code syntax exactly as they would normally appear.

Realistic total session savings are 4-5%, not the 75% headline figure. The 75% applies only to prose output, which is roughly 6,000 of 25,000 total output tokens. Combined with memory file compression and fewer conversation turns, practical savings reach 8-10% for heavy daily users. For the full cost optimization picture, see my AI agent cost optimization guide.

Yes. A March 2026 paper (arXiv 2604.00025) evaluated 31 models across 1,485 problems and found that brevity constraints improved large model accuracy by 26 percentage points on problems where verbosity caused errors. The mechanism is reduced "overelaboration" — verbose models talk themselves into wrong answers through excessive hedging and tangential reasoning.

Run npx skills add JuliusBrussee/caveman in your project directory. Activate with /caveman or /caveman full. Choose intensity: lite, full (default), or ultra. Deactivate with stop caveman. The skill also works with Cursor, Windsurf, GitHub Copilot, and 40+ other agents.

Add this line to your CLAUDE.md: "Be concise. No filler. No hedging. State conclusions first, reasoning second." This achieves roughly 40-50% prose compression compared to caveman's 75%, with nearly identical accuracy benefits. The key principle is the constraint itself, not the specific tool.

Let's Work Together

Looking to build AI systems, automate workflows, or scale your tech infrastructure? I'd love to help.

Coffee cup

Enjoyed this article?

Your support helps me create more in-depth technical content, open-source tools, and free resources for the developer community.

Related Topics

Engr Mejba Ahmed

About the Author

Engr Mejba Ahmed

Engr. Mejba Ahmed builds AI-powered applications and secure cloud systems for businesses worldwide. With 10+ years shipping production software in Laravel, Python, and AWS, he's helped companies automate workflows, reduce infrastructure costs, and scale without security headaches. He writes about practical AI integration, cloud architecture, and developer productivity.

Discussion

Comments

0

No comments yet

Be the first to share your thoughts

Leave a Comment

Your email won't be published

19  -  17  =  ?

Continue Learning

Related Articles

Browse All

Comments

Leave a Comment

Comments are moderated before appearing.

Learning Resources

Expand Your Knowledge

Accelerate your growth with structured courses, verified certificates, interactive flashcards, and production-ready AI agent skills.

Sample Certificate of Completion

Sample certificate — complete any course to earn yours

Engr Mejba Ahmed

Engr Mejba Ahmed

Claude Code Expert · Online

👋

Hey there!

Quick Actions

WhatsApp Instant reply

Chat on WhatsApp

+880 1723 741224 · Instant reply

Popular Questions

Engr Mejba Ahmed is connected
Engr Mejba Ahmed is typing...
Engr Mejba Ahmed avatar

✉ Want me to follow up? Drop your email

Engr Mejba Ahmed avatar

📞 Connect Directly

Choose how you'd like to reach me

WhatsApp

+880 1723 741224

Email

[email protected]

✓ Details sent! I'll get back to you shortly.

Powered by OpenAI

335+

Blog Posts

25

AI Courses

63

Projects

Services & Expertise

Pricing & Process

Learning & Resources

Connect & Support