Skip to main content
LLM APIs & Tooling Crawler graph 3 sliders

API Rate Limits, RPM, and TPM Explained

Two budgets you cannot ignore: requests per minute and tokens per minute. Understanding both — and the burst behaviour around them — keeps your prod stable.

· 3 min de lectura
Ir al laboratorio
▸ Pruébalo tú mismo

Arrastra un slider — el diagrama reacciona en tiempo real.

FR /100
¶ La analogía

The highway analogy

A highway has two limits: the number of cars that can enter per minute (a metering ramp) and the total weight of vehicles per minute (a bridge load limit). A truck convoy might pass the car count but blow the weight ceiling. A rush of tiny scooters might pass the weight check but overwhelm the metering ramp.

LLM APIs work the same way. RPM (requests per minute) is the car count. TPM (tokens per minute) is the weight. You hit whichever runs out first. Both matter; both fail differently.

The two main limits

  • RPM — Requests per minute. Hard cap on how many API calls you can fire. Rapid-fire small calls hit this first.
  • TPM — Tokens per minute. Aggregate input + output (sometimes input only, depending on provider). One enormous call hits this first.

Some providers add:

  • Concurrent requests — how many calls can be in-flight at once.
  • Tokens-per-day — daily ceiling, often on cheaper or trial tiers.
  • Per-model limits — Opus may have a tighter RPM than Haiku in the same account.

How rate limiting actually works

Most providers use a token bucket under the hood:

  • A bucket starts full.
  • Each request drains tokens proportional to its size.
  • The bucket refills at a constant rate (your TPM / RPM).
  • If a request finds an empty bucket, you get 429 Too Many Requests.

Two important consequences:

  1. Burst tolerance — you can briefly exceed your average rate if the bucket has been filling. Useful for spiky workloads.
  2. Sustained breach is impossible — over any long window, you can't exceed the refill rate. Plan for that.

Reading rate-limit headers

Production APIs return headers like:

  • x-ratelimit-limit-requests
  • x-ratelimit-remaining-requests
  • x-ratelimit-reset-requests (seconds until the bucket refills)
  • Same trio for tokens.

Treat these as the truth. Don't guess from a clock; respect what the server tells you.

Backoff that doesn't make things worse

When you hit a 429:

  • Exponential backoff with jitter. Wait min(cap, base · 2^attempt) plus a random fraction. Without jitter, all your clients retry in lockstep and the next slot is also overloaded.
  • Honour the retry-after header when present. The server knows when its bucket refills.
  • Cap the retries. A request stuck retrying for 60 seconds usually should fail upward, not keep trying.
  • Distinguish 429 from 503. 429 is "you're over quota." 503 is "we're overloaded." Both backoff, but the right alarms differ.

Architectural moves that prevent 429s

  • Concurrency cap on the client side. A semaphore limiting in-flight calls prevents thundering herds.
  • Queue with controlled drain. A worker pool draws from a queue at a rate just under your TPM ceiling. Smooth instead of spiky.
  • Token-aware throttling. Estimate input tokens before sending; reserve TPM in your queue accordingly.
  • Separate accounts for prod vs dev. A dev typing wildly should never share quota with paying customers.
  • Multiple keys / regions. For very large workloads, spreading load across multiple keys or geographic endpoints multiplies your headroom.

Tier upgrades

Most providers expose tiered limits that grow with your usage history and payment volume. New accounts start tight; sustained spend earns higher tiers automatically (or via support request). Plan for it during launch.

Common 429 root causes

  1. One slow consumer holding many concurrent requests open. Streaming requests count for a long time.
  2. Bursty front-end behaviour — a button users tap repeatedly when the app feels slow.
  3. Missing client-side throttling during retries — the retry storm itself becomes the load.
  4. Token estimation off — sending much longer prompts than expected drains TPM faster.

In one line

RPM is your call count, TPM is your token budget — and the difference between a healthy AI app and a flaky one is usually somewhere in their queueing, retry, and backoff code.

Engr Mejba Ahmed

Engr Mejba Ahmed

Claude Code Expert · Online

👋

Hey there!

Quick Actions

WhatsApp Instant reply

Chat on WhatsApp

+880 1723 741224 · Instant reply

Popular Questions

Engr Mejba Ahmed is connected
Engr Mejba Ahmed is typing...
Engr Mejba Ahmed avatar

✉ Want me to follow up? Drop your email

Engr Mejba Ahmed avatar

📞 Connect Directly

Choose how you'd like to reach me

WhatsApp

+880 1723 741224

Email

[email protected]

✓ Details sent! I'll get back to you shortly.

Powered by OpenAI

335+

Blog Posts

25

AI Courses

63

Projects

Services & Expertise

Pricing & Process

Learning & Resources

Connect & Support