Home Concept Explainers AI Operations & Production LLM Routing: Right Model for Right Task, With Fallbacks

AI Operations & Production Crawler graph 3 sliders

LLM Routing: Right Model for Right Task, With Fallbacks

A router classifies each call and sends it to the cheapest model that handles it. Add fallbacks for outages and you get cheaper *and* more reliable than a single-model setup.

Apr 29, 2026 · 4 min read

Jump to the lab No sign-up · Free forever

▸ Try it yourself

Drag any slider — the diagram reacts in real time.

Space to play · ←/→ to scrub

Crawler graph

FR /100 SN-514

SPACE · ◄ ►

¶ The analogy

The hospital-triage analogy

A hospital does not send every patient to the surgeon. A nurse triages: minor cuts go to the clinic, broken bones to the ER, complex cases to the specialist. The surgeon's time is not wasted on bandages. Patients are seen faster. Cost goes down; outcomes go up.

LLM routing is hospital triage. A small classifier (or rule set) picks the right model for each call: cheap models handle the easy 80%, mid models handle the rest, the expensive models are reserved for the genuinely hard 5%. Add fallbacks for outages and your system becomes cheaper and more reliable than any single-model setup.

What a router decides

For each incoming call, a router picks:

Which model (Haiku, Sonnet, Opus; or across providers).
Which provider (Anthropic, OpenAI, self-hosted, etc.) — for redundancy and price arbitrage.
Which region — latency-aware.
Which tier of strategies — single-shot, self-consistency, reflexion, with growing cost as task gets harder.

The router itself is usually a small fast model (Haiku) or a learned classifier on past traffic.

Three routing strategies

1. Rule-based

Hand-written rules: input length > X → bigger model; certain keywords → specialised model; specific feature → cheap model.

✅ Predictable, debuggable, no extra inference cost.
❌ Misses non-obvious patterns; rules rot as traffic shifts.

2. Learned classifier

A small model (Haiku, a fine-tuned small LLM, or even a sklearn classifier on embeddings) predicts which tier handles this best.

✅ Adapts to real traffic; catches subtler patterns.
❌ Adds a call (cheap, but real); needs training data.

3. Confidence-based escalation

Try the cheap model first. If output confidence is low (short, hesitant, schema-invalid, refusal), escalate to a stronger model.

✅ Pays only when needed; great for long-tail hard queries.
❌ Higher latency on escalations; needs reliable confidence signal.

Best in class is usually a mix: a cheap rule cuts the obvious cases, a small classifier handles the middle, confidence-based escalation catches the rest.

Fallbacks: the reliability half

Routing for cost is half the picture. Fallbacks for outages are the other:

Provider returns 503 / 429 / weird timeouts → fall back to a secondary provider.
Self-hosted model is cold or down → spill to hosted.
Specific model deprecated mid-flight → automatic redirect to its replacement.

Without fallbacks, every provider hiccup is a customer-facing incident. With fallbacks, most users never notice.

Engineering the router

Treat it as a service. It will be called on every request — make it fast (Haiku-cheap or in-process classifier).
Log every decision. Which tier was chosen, why, did the call succeed, was confidence low. Becomes training data and debugging gold.
A/B test changes. Routing changes have direct cost and quality consequences; ship them like product changes, not config tweaks.
Build budget gates. A user / feature with a budget cap routes only to cheap tiers regardless of difficulty.
Health-check upstream models. A fast probe per provider per minute lets the router know who's healthy before sending real traffic.

What "right model" actually means

Build a small eval that maps task class → cheapest passing model. For each class:

Define the eval set.
Run it against Haiku, Sonnet, Opus.
Pick the smallest model whose score clears your bar.
Encode that mapping in the router.

This is unglamorous and powerful. Most teams skip it and over-pay forever.

Common routing pitfalls

No measurement. Routing decisions baked from gut feel. Quality drifts; nobody notices.
Routing only on input length. A 50-word query can be a casual one or a hard logic puzzle. Length is a weak signal alone.
Ignoring downstream cost. Routing to a cheap model that hallucinates and triggers 3 retries is more expensive than the right model first try.
No model-version pinning. Router decides "Sonnet" but provider rotates underlying weights silently — quality moves under your feet. Pin to dated versions; upgrade deliberately.

Where routing wins biggest

Workloads with a wide difficulty distribution. Most calls easy, some hard.
Multi-tenant SaaS — different tiers get different routers.
Public chatbots — huge volume of trivial questions, occasional novel ones.
Agent toolchains — different steps need different muscle (planning vs extraction).

When NOT to bother

Monotask workloads where every call is roughly equally hard.
Tiny volumes where engineering > savings.
Quality-critical paths where the cheapest passing model is also the only acceptable one. Skip the router; just use that model.

In one line

A router and a fallback chain are the difference between "we have an AI feature" and "we have an AI feature that's cheap, fast, and stays up." Build both; review the routing decisions like you review code.

From the field

Routing pays off two ways and most teams build only the first. The cost half — send the bulk of easy traffic to a cheap model, escalate the hard cases — is the obvious win. The reliability half is the one people skip until an outage: a fallback to a second provider so one vendor's bad afternoon isn't your downtime. My other rule is to keep the router dumb — a few clear rules or a tiny classifier — because a clever router that itself calls a big model to decide has just added cost and a new thing to debug. Route on simple signals, always keep a fallback, and check the cheap path isn't quietly tanking quality.

→ Want this in your stack?

AWS Cloud Infrastructure & DevOps Engineering

Get an AWS environment that is secure by default, sized to what you actually run, and documented so your team can operate it without guesswork. Whether you are launching a new workload or inheriting a...

See how I can help