Home Concept Explainers AI Operations & Production AI Observability: Tracing Every Token in Production

AI Operations & Production MCP handshake 3 Slider

AI Observability: Tracing Every Token in Production

Without traces, every LLM bug is a guess. Capture prompts, tool calls, tokens, costs, and latencies for every request — searchable, filterable, alertable.

Apr 29, 2026 · 4 Min. Lesezeit

Zum Lab springen Keine Anmeldung · Für immer kostenlos

▸ Selbst ausprobieren

Zieh einen Slider — das Diagramm reagiert in Echtzeit.

Leertaste für Play · ←/→ zum Scrubben

MCP handshake

FR /100 SN-312

SPACE · ◄ ►

¶ Die Analogie

The flight-recorder analogy

Aircraft have black boxes because reconstructing what went wrong from memory is impossible at altitude. Pilots remember some of it, the data records all of it, and the difference between the two is what makes aviation safe.

AI observability is the black box for your agent. Every prompt, tool call, model response, latency, and cost is recorded — searchable when something goes weird, alertable when something breaks. Without it, every "the agent did something strange" report is a shrug.

What to capture per request

Inputs — full prompt (system + messages + tools), request metadata.
Outputs — full response (text + tool calls + reasoning if exposed).
Tokens — input, output, cached. Per-call and aggregated.
Cost — derived from tokens × model price. Per call, per session, per user.
Latency — TTFT, total, per-tool-call, per-step in agent loops.
Model version — exact ID, including dated suffix.
Tool execution — name, args, result, duration, error.
Context — request ID, trace ID, user / tenant ID, feature flag state.

If a single field is missing, you'll wish you had it the first time something breaks.

The trace shape

For an agent, a trace is a tree:

[ROOT request]
├── prompt cache check
├── LLM call #1 (decide)
│    └── tokens: in 12k, out 200, cost $0.04
├── tool: get_user (180ms)
├── LLM call #2 (compose)
│    └── tokens: in 12.4k, out 800, cost $0.08
└── response sent (TTFT 320ms, total 2.1s)

OpenTelemetry-compatible tracing libraries make this nearly free if you instrument the entry points. LangSmith, Helicone, Langfuse, Phoenix, OpenLLMetry are popular dedicated stacks; OTel + your existing APM works too.

What you query for

"All calls in the last hour over $0.50" — runaway costs.
"Sessions with tool errors" — flaky integrations.
"P99 latency by feature" — performance regressions.
"Agent loops that exceeded 15 steps" — stuck loops.
"Calls that violated schema" — broken structured outputs.
"All calls for user X in the last 24h" — debugging individual reports.

If your tracing setup can't answer those queries in seconds, it's not yet doing its job.

Sampling tradeoffs

Capturing everything is expensive (storage, vendor cost). Capturing only sampled traffic loses signal. The pragmatic middle:

100% capture of metadata (tokens, latency, cost, errors).
100% capture of failed/slow/over-budget requests.
Sampled capture (1–10%) of full prompt + response bodies.
Privacy filter — drop or hash PII before storage.

Alerts that earn their place

Cost spike — feature blew past its budget in 5 minutes.
Schema invalid rate — structured-output regressions after a prompt change.
Tool error rate — a downstream API is flaky.
Refusal rate — alignment regression after a model update.
Latency P99 — slow tail growing.

Skip alerts on absolute volume — those become noise. Alert on shape changes (rate, ratio, tail).

What observability prevents

Silent quality regressions after a model version change.
Cost runaways that surface only on the monthly bill.
Mystery user reports — "it didn't work last Tuesday." With traces, "last Tuesday" is searchable.
Mis-attributed failures — was it the model, the tool, the prompt, the input? Traces tell you.

Common gaps

Capturing input but not the system prompt. The system prompt is the configuration; without it, traces are useless.
No request ID propagation. Hard to tie an LLM call to the user-facing request that produced it.
No cost-per-feature view. You see total spend but cannot tell which feature is the leak.
PII in traces. A privacy incident waiting to happen. Filter at capture.
Storage forever. 90-day retention is usually plenty; longer becomes a liability.

In one line

AI observability turns "the model did something weird" from a story into a query — and that single capability is the difference between feature-shipping and incident-firefighting.

From the field

The observability regret is always the same: a user reports a bad answer and you can't reproduce it, because you logged the result but not the exact prompt, model, and retrieved context that produced it. So from line one of any LLM feature I capture the full request — final prompt, model and version, token counts, latency, and for RAG which chunks were retrieved — because that's the minimum to debug a "why did it say that?" ticket. And I trace the whole chain, not just the last call; in a multi-step flow the failure is usually three steps upstream. Capture first, dashboard later — you can't backfill data you never recorded.

→ Wollen Sie das in Ihrem Stack?

AWS Cloud Infrastructure & DevOps Engineering

Get an AWS environment that is secure by default, sized to what you actually run, and documented so your team can operate it without guesswork. Whether you are launching a new workload or inheriting a...

So kann ich helfen