The hospital-triage analogy
A hospital does not send every patient to the surgeon. A nurse triages: minor cuts go to the clinic, broken bones to the ER, complex cases to the specialist. The surgeon's time is not wasted on bandages. Patients are seen faster. Cost goes down; outcomes go up.
LLM routing is hospital triage. A small classifier (or rule set) picks the right model for each call: cheap models handle the easy 80%, mid models handle the rest, the expensive models are reserved for the genuinely hard 5%. Add fallbacks for outages and your system becomes cheaper and more reliable than any single-model setup.
What a router decides
For each incoming call, a router picks:
- Which model (Haiku, Sonnet, Opus; or across providers).
- Which provider (Anthropic, OpenAI, self-hosted, etc.) — for redundancy and price arbitrage.
- Which region — latency-aware.
- Which tier of strategies — single-shot, self-consistency, reflexion, with growing cost as task gets harder.
The router itself is usually a small fast model (Haiku) or a learned classifier on past traffic.
Three routing strategies
1. Rule-based
Hand-written rules: input length > X → bigger model; certain keywords → specialised model; specific feature → cheap model.
- ✅ Predictable, debuggable, no extra inference cost.
- ❌ Misses non-obvious patterns; rules rot as traffic shifts.
2. Learned classifier
A small model (Haiku, a fine-tuned small LLM, or even a sklearn classifier on embeddings) predicts which tier handles this best.
- ✅ Adapts to real traffic; catches subtler patterns.
- ❌ Adds a call (cheap, but real); needs training data.
3. Confidence-based escalation
Try the cheap model first. If output confidence is low (short, hesitant, schema-invalid, refusal), escalate to a stronger model.
- ✅ Pays only when needed; great for long-tail hard queries.
- ❌ Higher latency on escalations; needs reliable confidence signal.
Best in class is usually a mix: a cheap rule cuts the obvious cases, a small classifier handles the middle, confidence-based escalation catches the rest.
Fallbacks: the reliability half
Routing for cost is half the picture. Fallbacks for outages are the other:
- Provider returns 503 / 429 / weird timeouts → fall back to a secondary provider.
- Self-hosted model is cold or down → spill to hosted.
- Specific model deprecated mid-flight → automatic redirect to its replacement.
Without fallbacks, every provider hiccup is a customer-facing incident. With fallbacks, most users never notice.
Engineering the router
- Treat it as a service. It will be called on every request — make it fast (Haiku-cheap or in-process classifier).
- Log every decision. Which tier was chosen, why, did the call succeed, was confidence low. Becomes training data and debugging gold.
- A/B test changes. Routing changes have direct cost and quality consequences; ship them like product changes, not config tweaks.
- Build budget gates. A user / feature with a budget cap routes only to cheap tiers regardless of difficulty.
- Health-check upstream models. A fast probe per provider per minute lets the router know who's healthy before sending real traffic.
What "right model" actually means
Build a small eval that maps task class → cheapest passing model. For each class:
- Define the eval set.
- Run it against Haiku, Sonnet, Opus.
- Pick the smallest model whose score clears your bar.
- Encode that mapping in the router.
This is unglamorous and powerful. Most teams skip it and over-pay forever.
Common routing pitfalls
- No measurement. Routing decisions baked from gut feel. Quality drifts; nobody notices.
- Routing only on input length. A 50-word query can be a casual one or a hard logic puzzle. Length is a weak signal alone.
- Ignoring downstream cost. Routing to a cheap model that hallucinates and triggers 3 retries is more expensive than the right model first try.
- No model-version pinning. Router decides "Sonnet" but provider rotates underlying weights silently — quality moves under your feet. Pin to dated versions; upgrade deliberately.
Where routing wins biggest
- Workloads with a wide difficulty distribution. Most calls easy, some hard.
- Multi-tenant SaaS — different tiers get different routers.
- Public chatbots — huge volume of trivial questions, occasional novel ones.
- Agent toolchains — different steps need different muscle (planning vs extraction).
When NOT to bother
- Monotask workloads where every call is roughly equally hard.
- Tiny volumes where engineering > savings.
- Quality-critical paths where the cheapest passing model is also the only acceptable one. Skip the router; just use that model.
In one line
A router and a fallback chain are the difference between "we have an AI feature" and "we have an AI feature that's cheap, fast, and stays up." Build both; review the routing decisions like you review code.