18 Claude Code Token Hacks That Saved My Sessions
I watched 98.5% of my tokens vanish before Claude Code even started thinking about my actual question.
That number isn't a typo. A developer's analysis of Claude Code's token consumption revealed that in a long conversation — say, thirty messages deep — nearly all the tokens being charged are spent re-reading old chat history. Not generating new code. Not reasoning about your problem. Just... reading the same conversation back to itself, over and over, growing more expensive with every single message you send.
When I first saw that breakdown, I felt genuinely sick. I'd been blaming Anthropic's rate limits for my sessions dying after twenty minutes. I'd been eyeing the Max20 plan, convinced I needed a bigger quota. Turns out, the problem wasn't my plan size. The problem was me.
Here's what most Claude Code users don't realize: token usage doesn't scale linearly. It compounds. Your first message in a session might cost 500 tokens. By message thirty, that same exchange is costing 15,000 tokens — because Claude re-reads the entire conversation history on every single turn. Add system prompts, MCP server tool definitions, loaded skills, and any files you've pasted in, and you're bleeding tokens from sources you never even see.
The good news? Once I understood the mechanics, I cut my effective token waste by roughly 60% — same plan, same projects, dramatically longer sessions. What follows are the 18 specific techniques that made that possible, organized into three tiers based on how much effort they require and how much impact they deliver.
But first, you need to understand why your sessions are actually dying.
Why Your Claude Code Sessions Die So Fast
The mental model most developers carry is wrong. They think of token usage like a gas tank: you start full, each message uses a fixed amount, and eventually you hit empty. Simple, linear, predictable.
Reality is closer to a snowball rolling downhill.
Every time you send a message, Claude doesn't just process your new input. It re-reads everything — your system prompt, every MCP server's tool definitions, your CLAUDE.md file, the entire conversation history from message one, and then your new prompt. The response gets appended to that history. Next message? The whole thing gets read again, now with the previous response included.
Here's what that looks like in practice:
| Message | Approx. Token Cost (per turn) | Cumulative Session Tokens |
|---|---|---|
| 1 | ~500 | ~500 |
| 10 | ~5,000 | ~27,000 |
| 20 | ~10,000 | ~105,000 |
| 30 | ~15,000 | ~250,000+ |
That thirtieth message costs you thirty times what the first one did. And the cumulative total has blown past a quarter million tokens — most of which were spent on rereading, not reasoning.
There's a second problem hiding inside this one. Researchers call it "loss in the middle" — when the context window gets packed, Claude starts paying less attention to information buried in the middle of the conversation. Your carefully worded instructions from message five? By message twenty-five, they're functionally invisible. The model isn't just expensive at that point. It's actively getting worse.
This is why context hygiene matters more than plan size. A developer on the Pro plan with disciplined token management will outperform a Max20 subscriber who treats conversations like a stream-of-consciousness journal.
Now that you understand the mechanics, let's fix them — starting with the changes you can make in the next five minutes.
Tier 1: The Quick Wins (Implement Today)
These nine techniques require zero setup, minimal habit change, and deliver immediate results. If you do nothing else from this article, do these.
Start Fresh Conversations for Unrelated Tasks
This is the single most impactful habit change on this entire list, and it costs you nothing.
When you finish debugging an authentication flow and pivot to styling a dashboard component, those authentication tokens are still in the conversation. Claude is re-reading your entire auth debugging history on every dashboard styling message. You're paying for context that's actively irrelevant — and potentially confusing the model.
The /clear command exists for exactly this reason. Use it aggressively. I clear my context whenever I shift to a genuinely different task, even if it's in the same project. The five seconds it takes to re-establish context is nothing compared to the token savings of not dragging twenty irrelevant messages through every subsequent turn.
My rule of thumb: if the next task doesn't directly build on the last three messages, /clear first.
Disconnect Unused MCP Servers
This one shocked me when I first ran /context and saw the breakdown. Each connected MCP server loads its entire tool definition schema into the context window on every single message. A Figma MCP, a Slack MCP, a database MCP, and a file system MCP running simultaneously can eat thousands of tokens per turn — before you've typed a single character.
If you're writing code and don't need Figma, disconnect it. If you're designing and don't need your database tools, disconnect them. I keep a minimal set of MCPs active for my current task and reconnect others only when I specifically need them.
The difference is measurable. On one project, disconnecting three idle MCPs dropped my per-turn overhead by roughly 4,000 tokens. Over a thirty-message session, that's 120,000 tokens saved — tokens that went toward actual productive work instead of loading tool schemas I never touched.
Batch Your Prompts Into Single Messages
This is basic arithmetic, but most people miss it. If you need Claude to create a component, add tests for it, and update the import file, that's one message — not three.
Three separate messages means three full context re-reads. One batched message means one re-read for the same amount of work. The savings compound as your conversation gets longer.
I format batched requests like this:
Do these three things in order:
1. Create a UserProfile component in src/components/ with name, email, and avatar props
2. Write tests for it using Vitest — cover the rendering, prop variations, and empty state
3. Update src/components/index.ts to export the new component
Claude handles multi-step instructions well. The key is being specific about the order and the expected output for each step. Vague batches create confusion; precise batches save tokens.
Use Plan Mode Before Complex Tasks
Jumping straight into implementation on a complex feature is one of the most expensive mistakes you can make. Not because the first attempt costs a lot — but because the wrong first attempt triggers a correction cycle that doubles or triples your total token spend.
Plan mode asks Claude to outline its approach before writing code. You review the plan, course-correct if needed, and then give the green light. This front-loads alignment into a single low-cost exchange instead of discovering misalignment six messages deep when the context window is already bloated.
I use plan mode for anything that touches more than two files or involves architectural decisions. For simple single-file changes, I skip it. The judgment call is: "If Claude gets this wrong on the first try, how expensive is the correction?" If the answer is "very," plan first.
Run /context and /cost to See Where Tokens Go
You can't optimize what you can't measure. The /context command — introduced in Claude Code v1.0.86 — breaks down exactly where your tokens are allocated: system prompt, tool definitions, memory files, skills, conversation history, and your actual prompt.
The first time I ran it, I discovered my CLAUDE.md file was consuming 12% of my available context on every single turn. A file I'd written once and forgotten about was silently taxing every interaction. I trimmed it from 400 lines to 120, and the per-turn savings were immediate.
The /cost command shows cumulative API token usage for the session. If you're on an API plan, this tells you your spend in real time. For Max subscribers, it's less about billing and more about understanding how fast you're burning through your usage allocation.
Run both commands at the start of every session. Make it a reflex, like checking your mirrors before driving.
Set Up a Token Usage Status Line
If running /cost manually feels like too much friction, configure your terminal status line to display token usage continuously. You'll see the percentage climbing in real time as you work, which creates a natural feedback loop — you start noticing which types of messages are expensive and which are cheap.
I keep the token percentage visible in my terminal at all times. It's like having a fuel gauge on your dashboard. You don't stare at it constantly, but you glance at it often enough to avoid running dry unexpectedly.
Keep the Dashboard Open
Anthropic's usage dashboard shows your consumption across sessions. Open it in a browser tab and check it a few times during a working day, especially during heavy development sessions. If you're burning through your five-hour allocation faster than expected, you'll catch it early enough to adjust your approach rather than discovering it when the session locks you out.
Only Paste What's Relevant
When you need Claude to understand a file, don't paste the whole thing if only one function matters. I've watched developers paste 800-line files when the relevant section was 40 lines. That's 760 lines of pure waste — loaded into context on every subsequent message.
Be surgical. Copy the specific function, the specific config block, the specific error output. If Claude needs more context, it'll ask. Starting with less is almost always cheaper than starting with everything.
Watch Claude's Output in Real Time
When Claude is generating a long response — building out a large component, writing extensive tests — watch it happen. If you see it heading down the wrong path (wrong framework, wrong file structure, misunderstood requirements), stop it immediately.
Every token Claude generates gets added to the conversation history. A 2,000-token response you didn't want is 2,000 tokens you'll re-read on every future message. Catching a wrong turn after 200 tokens instead of 2,000 saves you on the current message and every message that follows.
I've saved entire sessions this way. One time Claude started generating a REST API when I needed GraphQL resolvers. I caught it within the first function signature and stopped it. Had I walked away and come back to a completed wrong implementation, the correction cycle would have burned through my remaining context budget.
That covers the quick wins. If you've implemented even half of these, you're already ahead of most Claude Code users. But the real efficiency gains come from the structural changes in the next tier — and one of them completely changed how I think about the CLAUDE.md file.
Tier 2: Structural Optimizations (Weekend Project)
These five techniques require some upfront investment — reorganizing files, changing habits, adjusting timing — but they deliver compounding returns over every session that follows.
Keep Your CLAUDE.md Under 200 Lines
I've written about this before in my 50 Claude Code tips guide, but it bears repeating because it's that important. Your CLAUDE.md gets loaded on every single message. It's not a one-time cost — it's a per-turn tax.
Treat CLAUDE.md as an index, not an encyclopedia. It should contain project architecture at a glance, build commands, hard rules, and pointers to longer documentation files. Not the documentation itself.
Here's the mental model that works: your CLAUDE.md is a table of contents. When Claude needs the actual chapter, it can read the file. But loading every chapter into memory on every message — that's the part that kills you.
I restructured mine from a 400-line reference document to a 120-line index that points to detailed docs in a /docs directory. The per-turn token savings were roughly 3,000 tokens. Over a typical 25-message session, that's 75,000 tokens reclaimed for actual work.
Be Surgical With File References
"Look at my codebase and suggest improvements" is the most expensive prompt you can write. It triggers Claude to scan everything — every file, every directory — burning tokens on code that has nothing to do with what you actually want improved.
Instead: "Review the error handling in src/services/payment.ts, specifically the processRefund function on lines 45-80." That's a scalpel. The first prompt is a sledgehammer.
I've made it a habit to always include file paths and, when possible, line numbers or function names in my prompts. The more precisely you direct Claude's attention, the fewer tokens it spends looking in the wrong places.
Compact at 60%, Not 95%
Claude Code has an auto-compact feature that triggers when the context window hits approximately 95% capacity. The /compact command summarizes the conversation history and replaces it with a compressed version, freeing up space.
Here's the problem with waiting until 95%: by that point, the model has been degrading for a while. The "loss in the middle" effect means Claude's output quality drops well before the context window is technically full. And the compaction itself is less effective when there's more to compress — you lose more nuance.
I compact manually at around 60% capacity. Earlier than most people recommend, and that's deliberate. The compaction preserves more relevant detail when there's less to summarize, and the remaining 40% of clean context gives me a solid runway for the next phase of work.
You can also add custom instructions to guide what gets preserved: /compact Focus on the authentication refactoring decisions and the API endpoint signatures. This tells Claude what matters during summarization instead of letting it decide.
Mind the Cache Timeout
This one catches people off guard. Claude Code uses prompt caching — it caches frequently repeated content (system prompts, tool definitions, conversation history) to avoid reprocessing it from scratch. Cached input tokens are significantly cheaper, billed at roughly 10% of the normal rate.
But the cache has a timeout. Take a five-plus minute break — grab coffee, answer a Slack message, get pulled into a meeting — and the cache expires. Your next message triggers a full reprocessing of the entire context at full token cost. A 200,000-token conversation that was being cached efficiently suddenly becomes a 200,000-token cold read.
Two strategies help here. First, if you know you're stepping away for more than a few minutes, /compact before you leave. Smaller context means cheaper reprocessing when you return. Second, if you're coming back from a long break to a bloated conversation, consider /clear and starting fresh with a brief summary of where you left off. It's almost always cheaper than paying for a full cold re-read of a long history.
Control Command Output Bloat
When Claude runs shell commands — npm install, git log, test suites — the full output enters the context window. A verbose test runner dumping hundreds of lines of passing tests? All of it gets stored. A git log that returns fifty commits? Every line becomes context you'll re-read on every future message.
Be deliberate about what commands Claude runs. If you need test results, ask for just the failures: "Run the test suite and only show me failing tests." If you need git history, limit it: "Show me the last 5 commits on this branch." If Claude suggests running a command that'll produce massive output, consider whether you actually need all of it — or just a summary.
I've started adding output restrictions to my CLAUDE.md as a default rule: "When running test suites, suppress passing test output. When checking git history, limit to 10 entries unless specifically asked for more." This prevents token bloat without requiring me to think about it on every command.
These structural changes took me a Saturday afternoon to implement fully. The ROI has been enormous — I'm estimating 40-50% longer sessions on average, and the quality of Claude's responses in the back half of long sessions improved noticeably. The context stays cleaner, so the model stays sharper.
But for users who are pushing Claude Code hard — running multi-agent workflows, building complex systems, or working through peak-hour rate limits — the advanced tier is where the real mastery lives.
Tier 3: Advanced Token Engineering (For Power Users)
These four techniques require a deeper understanding of how Claude Code works under the hood. They're not for everyone. But if you're the type of developer who runs autonomous agent systems or pushes multi-hour sessions daily, this is where the largest gains hide.
Choose the Right Model for Each Task
Not every task needs the most powerful model. Claude Code gives you access to multiple models, and the token economics vary dramatically between them.
Sonnet handles the vast majority of coding tasks — generating components, writing tests, refactoring functions, debugging errors. It's fast, capable, and costs significantly fewer tokens per turn than Opus.
Haiku is perfect for simple, mechanical work: formatting code, renaming variables, generating boilerplate, basic text processing. Using Haiku for these tasks instead of Sonnet is like taking a bicycle for a two-block trip instead of driving.
Opus is the heavy artillery. Deep architectural planning, complex multi-system reasoning, nuanced analysis that requires holding many constraints in mind simultaneously. I use Opus sparingly — maybe 15% of my total Claude Code interactions — and only for tasks where the depth of reasoning genuinely justifies the token premium.
I covered model selection strategy in detail in my AI agent cost optimization guide, but the core principle applies directly here: match the model's capability to the task's requirements. Using Opus to rename a variable is like hiring a surgeon to apply a bandage.
If you'd rather have someone build optimized AI agent systems from scratch, I take on custom automation and integration projects. You can see what I've built at fiverr.com/s/EgxYmWD.
Use Sub-Agents Strategically (Not Liberally)
Sub-agents are powerful because they run in separate context windows. Your main conversation stays clean while the sub-agent handles a focused task and returns a summary. In theory, this is perfect for token management.
In practice, sub-agents are expensive. Each one loads the full context overhead — system prompts, MCP definitions, CLAUDE.md — from scratch. A sub-agent session can consume 7-10x more tokens than handling the same task in your main conversation, depending on the complexity.
The math works in your favor when: the task would add significant bloat to your main context (large file analysis, extensive code generation), the task is cleanly separable, and a summary of the result is sufficient.
The math works against you when: the task is small, the result needs extensive discussion, or you'd need multiple sub-agents for related tasks that share context.
I use sub-agents for research tasks — "analyze this dependency tree and tell me which packages are outdated" — and for code generation that I'll review separately. I avoid them for iterative work where I'd need to go back and forth with the agent multiple times.
Understand Peak vs. Off-Peak Token Economics
According to Anthropic's own documentation, the average Claude Code cost is $6 per developer per day, with 90% of users staying under $12 daily. But that average masks a significant variance based on when you work.
Peak hours — roughly 8 AM to 2 PM Eastern Time on weekdays — coincide with maximum demand on Anthropic's infrastructure. During these windows, rate limiting is more aggressive, context budgets can feel tighter, and heavy sessions get throttled faster.
Off-peak hours — afternoons, evenings, and weekends — offer more headroom. The same plan, the same prompts, but with less contention for resources.
My adjustment was simple: I shifted my heavy multi-agent sessions and large refactoring work to off-peak hours. Quick questions and small tasks happen whenever I need them. But the sessions where I'm burning through tokens aggressively — those happen after 3 PM Eastern or on weekend mornings.
This isn't about getting more tokens. It's about getting more consistent performance from the tokens you have. Peak-hour rate limiting can interrupt flow states and force premature session breaks that waste even more tokens on context rebuilding.
Build a Systems Constitution in Your CLAUDE.md
This is the most sophisticated technique on the list, and it's the one that delivered the best long-term results.
A systems constitution is a section of your CLAUDE.md that captures stable architectural decisions, progress summaries, and operational rules — not as documentation, but as persistent instructions that shape every interaction.
Here's what goes in it:
Architectural decisions that are settled. "This project uses the repository pattern for all database access. Never suggest direct query builders in controllers." This prevents Claude from re-debating decisions you've already made, which saves the back-and-forth tokens that come from correcting suggestions.
Progress markers. "Authentication module: complete and tested. Payment integration: in progress, Stripe webhook handler needs error retry logic." This gives Claude instant project awareness without needing to scan your codebase or ask questions.
Token-saving rules. "Delegate research tasks to sub-agents. Summarize file analysis results in under 100 words before presenting. Never output full file contents when a diff would suffice." These rules compound — they save tokens on every single interaction automatically.
The key principle: save decisions, not conversations. Your constitution should capture the conclusions of previous discussions, not the discussions themselves. "We decided to use Redis for session storage because PostgreSQL was causing latency issues under load" is useful context in one line. The full conversation where you explored that decision? That's fifty lines of context you don't need to carry forward.
I update my systems constitution at the end of every major development session. It takes two minutes and saves me ten minutes of re-establishing context at the start of the next session. Over weeks and months, the compound savings are substantial.
The Mindset Shift That Ties Everything Together
If you've read this far, you might be thinking these 18 techniques feel like a lot of overhead. Tracking token percentages, timing your sessions, restructuring your CLAUDE.md, manually compacting at 60%. Is all of this really necessary?
Here's my honest answer: not all of it. Not all at once.
Start with the Tier 1 basics. /clear between unrelated tasks, disconnect idle MCPs, batch your prompts. These three habits alone will extend your sessions noticeably. Once those feel natural — give it a week — layer in the Tier 2 structural changes. The CLAUDE.md restructuring and manual compaction habit will deliver the next big jump.
Tier 3 is for when you're pushing the tool hard enough that incremental gains matter. Most developers won't need all four advanced techniques. But the model selection strategy and the systems constitution are worth implementing regardless of your usage level.
The overarching insight — the thing I wish someone had told me six months ago — is that hitting token limits isn't a sign that your plan is too small. It's almost always a sign that your context hygiene needs work. The tokens are there. You're just spending them on the wrong things.
Anthropic acknowledged in late March 2026 that users were hitting limits faster than expected, and they've made it their top engineering priority. Infrastructure improvements are coming. But even when quotas increase, these techniques will still matter — because clean context doesn't just save tokens. It produces better output. A model working with 50,000 tokens of focused, relevant context will outperform the same model struggling through 200,000 tokens of accumulated noise.
Think of it this way: token management isn't about being stingy with AI resources. It's about being precise with them. The same way a skilled developer writes clean, focused code instead of bloated spaghetti — not because they're constrained, but because clarity produces better results.
Your sessions will last longer. Your outputs will be sharper. And you'll stop blaming the tool for a problem that was always about the workflow.
What to Do in the Next Ten Minutes
Close this article and open your active Claude Code session. Run /context. Look at the breakdown. I guarantee something in there will surprise you — a bloated CLAUDE.md, three MCP servers you forgot were connected, a conversation history that's 80% irrelevant.
Fix the biggest offender. Just one. Then apply two or three of the Tier 1 techniques during your next working session.
Come back to this article in a week and implement the Tier 2 changes. By that point, you'll have enough firsthand experience with the token mechanics to understand exactly why each structural change matters — because you'll have felt the pain points yourself.
The developers who master Claude Code aren't the ones with the biggest plans. They're the ones who waste the fewest tokens on things that don't matter. That's a skill you can build, starting right now.
FAQ
Frequently Asked Questions
Everything you need to know about this topic
Run /context to see a detailed breakdown of where tokens are allocated — system prompt, tools, memory files, and conversation history. Run /cost to see cumulative API token usage for the current session. Both commands are available in Claude Code v1.0.86 and later.
/clear completely wipes conversation history and starts fresh. /compact summarizes the existing conversation and replaces the full history with a compressed version, preserving key context while freeing tokens. Use /clear when switching tasks entirely; use /compact when continuing the same task but need more headroom.
The "loss in the middle" effect causes Claude to pay less attention to information buried deep in the context window. As conversations grow, earlier instructions and context get pushed into this low-attention zone, reducing output quality. Compacting at 60% capacity — rather than waiting for the 95% auto-trigger — helps maintain response quality throughout the session.
Token costs compound with conversation length. A first message costs roughly 500 tokens, but by message 30, each turn can cost 15,000+ tokens due to full context re-reading. According to Anthropic's data, the average daily cost is $6 per developer, with 90% of users staying under $12.
Yes, significantly. Each connected MCP server loads its full tool definition schema into the context window on every message. Running multiple MCP servers simultaneously can add thousands of tokens per turn. Disconnect any MCP servers you're not actively using to reduce this overhead.
Let's Work Together
Looking to build AI systems, automate workflows, or scale your tech infrastructure? I'd love to help.
- Fiverr (custom builds & integrations): fiverr.com/s/EgxYmWD
- Portfolio: mejba.me
- Ramlit Limited (enterprise solutions): ramlit.com
- ColorPark (design & branding): colorpark.io
- xCyberSecurity (security services): xcybersecurity.io