Token Management and Cost Optimization
Tokens are the unit of cost and context. Managing them well is the difference between a profitable product and one that burns budget.
Counting Tokens Before Sending
Use the token counting API to preview cost before making expensive calls:
import anthropic
client = anthropic.Anthropic()
messages = [
{"role": "user", "content": "Analyze this 10,000 word document and summarize..."}
]
# Count tokens without consuming API budget
token_count = client.messages.count_tokens(
model="claude-sonnet-4-5",
messages=messages,
)
print(f"This request will use {token_count.input_tokens} input tokens")
# Reject if too expensive
MAX_INPUT_TOKENS = 50_000
if token_count.input_tokens > MAX_INPUT_TOKENS:
raise ValueError(f"Request too large: {token_count.input_tokens} tokens")
Tracking Usage Per Request
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=2048,
messages=messages,
)
# Log usage for cost tracking
usage = response.usage
print(f"Input: {usage.input_tokens} tokens")
print(f"Output: {usage.output_tokens} tokens")
print(f"Cache read: {getattr(usage, 'cache_read_input_tokens', 0)} tokens")
print(f"Cache write: {getattr(usage, 'cache_creation_input_tokens', 0)} tokens")
Prompt Caching
For repeated system prompts or large documents, prompt caching can reduce costs by up to 90%:
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are an expert code reviewer. " + large_style_guide, # Cached
"cache_control": {"type": "ephemeral"},
}
],
messages=[{"role": "user", "content": "Review this pull request: ..."}],
)
JavaScript Token Tracking
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const response = await client.messages.create({
model: "claude-sonnet-4-5",
max_tokens: 1024,
messages: [{ role: "user", content: "Explain recursion." }],
});
const { input_tokens, output_tokens } = response.usage;
const COST_PER_INPUT_MTK = 3.0; // $3 per million tokens (Sonnet)
const COST_PER_OUTPUT_MTK = 15.0;
const cost =
(input_tokens / 1_000_000) * COST_PER_INPUT_MTK +
(output_tokens / 1_000_000) * COST_PER_OUTPUT_MTK;
console.log(`Cost: $${cost.toFixed(6)}`);
Context Window Limits
| Model | Context Window |
|---|---|
| Claude Haiku | 200K tokens |
| Claude Sonnet | 200K tokens |
| Claude Opus | 200K tokens |
200K tokens is approximately 150,000 words — but sending that much in every request is expensive. Truncate conversation history or use RAG (Chapter 5) to keep context lean.