Home Concept Explainers LLM APIs & Tooling API Rate Limits, RPM, and TPM Explained

LLM APIs & Tooling Crawler graph 3 sliders

API Rate Limits, RPM, and TPM Explained

Two budgets you cannot ignore: requests per minute and tokens per minute. Understanding both — and the burst behaviour around them — keeps your prod stable.

Apr 29, 2026 · 4 min de lectura

Ir al laboratorio Sin registro · Gratis para siempre

▸ Pruébalo tú mismo

Arrastra un slider — el diagrama reacciona en tiempo real.

Espacio para play · ←/→ para scrubear

Crawler graph

FR /100 SN-514

SPACE · ◄ ►

¶ La analogía

The highway analogy

A highway has two limits: the number of cars that can enter per minute (a metering ramp) and the total weight of vehicles per minute (a bridge load limit). A truck convoy might pass the car count but blow the weight ceiling. A rush of tiny scooters might pass the weight check but overwhelm the metering ramp.

LLM APIs work the same way. RPM (requests per minute) is the car count. TPM (tokens per minute) is the weight. You hit whichever runs out first. Both matter; both fail differently.

The two main limits

RPM — Requests per minute. Hard cap on how many API calls you can fire. Rapid-fire small calls hit this first.
TPM — Tokens per minute. Aggregate input + output (sometimes input only, depending on provider). One enormous call hits this first.

Some providers add:

Concurrent requests — how many calls can be in-flight at once.
Tokens-per-day — daily ceiling, often on cheaper or trial tiers.
Per-model limits — Opus may have a tighter RPM than Haiku in the same account.

How rate limiting actually works

Most providers use a token bucket under the hood:

A bucket starts full.
Each request drains tokens proportional to its size.
The bucket refills at a constant rate (your TPM / RPM).
If a request finds an empty bucket, you get 429 Too Many Requests.

Two important consequences:

Burst tolerance — you can briefly exceed your average rate if the bucket has been filling. Useful for spiky workloads.
Sustained breach is impossible — over any long window, you can't exceed the refill rate. Plan for that.

Reading rate-limit headers

Production APIs return headers like:

x-ratelimit-limit-requests
x-ratelimit-remaining-requests
x-ratelimit-reset-requests (seconds until the bucket refills)
Same trio for tokens.

Treat these as the truth. Don't guess from a clock; respect what the server tells you.

Backoff that doesn't make things worse

When you hit a 429:

Exponential backoff with jitter. Wait min(cap, base · 2^attempt) plus a random fraction. Without jitter, all your clients retry in lockstep and the next slot is also overloaded.
Honour the retry-after header when present. The server knows when its bucket refills.
Cap the retries. A request stuck retrying for 60 seconds usually should fail upward, not keep trying.
Distinguish 429 from 503. 429 is "you're over quota." 503 is "we're overloaded." Both backoff, but the right alarms differ.

Architectural moves that prevent 429s

Concurrency cap on the client side. A semaphore limiting in-flight calls prevents thundering herds.
Queue with controlled drain. A worker pool draws from a queue at a rate just under your TPM ceiling. Smooth instead of spiky.
Token-aware throttling. Estimate input tokens before sending; reserve TPM in your queue accordingly.
Separate accounts for prod vs dev. A dev typing wildly should never share quota with paying customers.
Multiple keys / regions. For very large workloads, spreading load across multiple keys or geographic endpoints multiplies your headroom.

Tier upgrades

Most providers expose tiered limits that grow with your usage history and payment volume. New accounts start tight; sustained spend earns higher tiers automatically (or via support request). Plan for it during launch.

Common 429 root causes

One slow consumer holding many concurrent requests open. Streaming requests count for a long time.
Bursty front-end behaviour — a button users tap repeatedly when the app feels slow.
Missing client-side throttling during retries — the retry storm itself becomes the load.
Token estimation off — sending much longer prompts than expected drains TPM faster.

In one line

RPM is your call count, TPM is your token budget — and the difference between a healthy AI app and a flaky one is usually somewhere in their queueing, retry, and backoff code.

From the field

Rate limits aren't an edge case to handle later — they're a certainty to design for on day one, because the day your feature gets popular is the day you start hitting 429s. Exponential backoff with jitter is table stakes (without jitter, all your retries collide and make it worse), but the deeper fix is needing fewer calls: cache aggressively, batch where the API allows, and don't fire a request you could have avoided. I put a queue in front of any high-volume path so bursts smooth out instead of slamming the limit. Treat capacity as a budget you schedule, not a wall you discover in production.

→ ¿Lo quieres en tu stack?

Custom SaaS App, AI Dashboard & Web Application Development — Full-Stack Engineer

Need a SaaS app, AI dashboard, or web application built fast and production-ready?I build full-stack AI-powered products using vibe coding with Lovable AI, React, Next.js, Tailwind CSS, Supabase, Pyth...

Ver cómo puedo ayudar