The whiteboard analogy
Imagine a small whiteboard. You can fit a few sentences before you run out of room. Erase older ideas to make space, or buy a bigger board (it costs more).
A context window is the model's whiteboard. Tokens are the marker strokes — chunks of text, usually 3–4 characters each. Every prompt, every response, every system instruction has to fit on the board at once.
Run past the edge and the model forgets the start. Buy more board (a bigger context model) and you pay per stroke.
What a token actually is
A token is a sub-word unit produced by the model's tokenizer. Examples in English (roughly):
hello→ 1 tokenrunning→ 1–2 tokensunbelievability→ 4–5 tokens- a Chinese character → often 2–3 tokens
Rule of thumb: 1 token ≈ 4 characters ≈ 0.75 words in English. Code and non-English text use more tokens per character.
The context window
The context window is the maximum tokens the model can attend to in one call. Modern values:
| Model class | Context window |
|---|---|
| Older small models | 4k–8k tokens |
| Mid-tier production | 32k–128k tokens |
| Long-context flagships | 200k–1M+ tokens |
Everything you send — system prompt, history, retrieved docs, user message — plus everything you receive must fit inside.
Why cost climbs faster than you expect
Most APIs price input tokens and output tokens separately, and output is usually 3–5× more expensive. Two traps:
- Conversation drift — every turn re-sends the entire history. A 50-turn chat at 500 tokens/turn ships 25k tokens every call.
- Verbose system prompts — a 2k-token instruction block runs on every single request.
Cache what's stable, summarise what's old, and ask for terser outputs.
Practical levers
- Trim system prompts — every paragraph costs you on every call.
- Prompt caching — providers reuse cached prefixes at a discount.
- Output caps — set
max_tokensso a runaway response can't blow your bill. - Streaming — does not save tokens, but lets you cut off early when the answer is good enough.