Skip to main content

Token Management and Cost Optimization

8/28
Chapter 2 Claude API Fundamentals

Token Management and Cost Optimization

20 min read Lesson 8 / 28

Token Management and Cost Optimization

Tokens are the unit of cost and context. Managing them well is the difference between a profitable product and one that burns budget.

Counting Tokens Before Sending

Use the token counting API to preview cost before making expensive calls:

import anthropic

client = anthropic.Anthropic()

messages = [
    {"role": "user", "content": "Analyze this 10,000 word document and summarize..."}
]

# Count tokens without consuming API budget
token_count = client.messages.count_tokens(
    model="claude-sonnet-4-5",
    messages=messages,
)

print(f"This request will use {token_count.input_tokens} input tokens")

# Reject if too expensive
MAX_INPUT_TOKENS = 50_000
if token_count.input_tokens > MAX_INPUT_TOKENS:
    raise ValueError(f"Request too large: {token_count.input_tokens} tokens")

Tracking Usage Per Request

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=2048,
    messages=messages,
)

# Log usage for cost tracking
usage = response.usage
print(f"Input:  {usage.input_tokens} tokens")
print(f"Output: {usage.output_tokens} tokens")
print(f"Cache read: {getattr(usage, 'cache_read_input_tokens', 0)} tokens")
print(f"Cache write: {getattr(usage, 'cache_creation_input_tokens', 0)} tokens")

Prompt Caching

For repeated system prompts or large documents, prompt caching can reduce costs by up to 90%:

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are an expert code reviewer. " + large_style_guide,  # Cached
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": "Review this pull request: ..."}],
)

JavaScript Token Tracking

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const response = await client.messages.create({
  model: "claude-sonnet-4-5",
  max_tokens: 1024,
  messages: [{ role: "user", content: "Explain recursion." }],
});

const { input_tokens, output_tokens } = response.usage;
const COST_PER_INPUT_MTK = 3.0; // $3 per million tokens (Sonnet)
const COST_PER_OUTPUT_MTK = 15.0;

const cost =
  (input_tokens / 1_000_000) * COST_PER_INPUT_MTK +
  (output_tokens / 1_000_000) * COST_PER_OUTPUT_MTK;

console.log(`Cost: $${cost.toFixed(6)}`);

Context Window Limits

Model Context Window
Claude Haiku 200K tokens
Claude Sonnet 200K tokens
Claude Opus 200K tokens

200K tokens is approximately 150,000 words — but sending that much in every request is expensive. Truncate conversation history or use RAG (Chapter 5) to keep context lean.