Home Concept Explainers Multimodal AI Speech-to-Text: From Sound Waves to Sentences

Multimodal AI MCP handshake 3 Slider

Speech-to-Text: From Sound Waves to Sentences

Modern ASR is one big neural network: audio in, text out. The pipeline used to be five hand-tuned stages; now it is a single Transformer.

Apr 29, 2026 · 3 Min. Lesezeit

Zum Lab springen Keine Anmeldung · Für immer kostenlos

▸ Selbst ausprobieren

Zieh einen Slider — das Diagramm reagiert in Echtzeit.

Leertaste für Play · ←/→ zum Scrubben

MCP handshake

FR /100 SN-312

SPACE · ◄ ►

¶ Die Analogie

The ear analogy

A human ear does not return phonemes or word boundaries. It just turns pressure waves into nerve signals. The brain — using context, expectations, lip-reading, prior conversation — turns those signals into "she said 'meet me at noon.'" The brain handles every step in one tangled process.

Modern automatic speech recognition (ASR) works the same way. One neural network ingests audio and outputs text. No phoneme dictionary, no separate language model — the network has internalised it all.

How a modern ASR pipeline looks

Audio in — waveform sampled at 16 kHz (the standard for speech).
Spectrogram / mel features — slice the audio into ~25 ms windows, compute frequency content, build a 2D image of "energy at each frequency over time." This is what the model actually sees.
Encoder — a Transformer (or Conformer, an audio-tweaked variant) processes the mel features.
Decoder / CTC head — emits text tokens conditioned on the encoder output.

Whisper and similar models are basically: mel-encoder + text-decoder + a lot of data.

Why one big model beat the old stack

Pre-2020 ASR was a chain: feature extraction → acoustic model (sounds → phonemes) → pronunciation lexicon (phonemes → words) → language model (words → sentence). Each stage had its own training, hyperparameters, and failure modes. Errors compounded.

End-to-end neural ASR fuses all stages into one network trained jointly on (audio, transcript) pairs. Every stage is now learned. Errors can no longer propagate through hand-coded interfaces because there are no interfaces.

Three things that make or break quality

Domain match — a model trained on conversational English will struggle on medical dictation, courtroom audio, or accented speech it never saw. Domain fine-tuning matters more than model size.
Audio quality — noise, low bitrate, far-field microphones, two speakers overlapping. The model is blamed; the audio is the problem.
Streaming vs offline — offline can use full bidirectional context (high accuracy). Streaming has to commit to words as they come (lower accuracy, lower latency). The trade-off is real.

The metrics

WER (word error rate) — fraction of words wrong (insertions + deletions + substitutions ÷ total). 5% on clean read speech is easy; 15% on noisy real-world audio is good; 25%+ is common on hard domains.
Real-time factor (RTF) — processing time / audio duration. <1 means faster than real-time. Streaming systems target RTF ≪ 1 with low first-token latency.

Beyond plain transcription

Modern ASR systems often bundle:

Speaker diarisation — "who said what."
Punctuation and casing — adds the dots and commas the raw model lacks.
Timestamps — per-word or per-sentence; essential for subtitling and search.
Language ID — auto-detect the language before transcribing.
Code-switching handling — multilingual mid-sentence (still hard for most systems).

Practical engineering

Resample to 16 kHz mono before sending to the model, even if your source is 48 kHz stereo. Bandwidth saved, quality identical.
Voice activity detection (VAD) strips silence so you don't pay for empty audio. Easy 2–5× cost savings on real call recordings.
Chunk long audio with overlap so you don't drop words at the boundaries.
Post-process — domain-specific find/replace ("fdr" → "Federal Reserve"), regex-based confidence flags, custom punctuation rules.

The model gives you a transcript. The pipeline around it makes that transcript usable.

From the field

Two cheap moves decide whether a speech feature is good and affordable, and neither is the model. First, run voice-activity detection to strip silence before you send audio — on real call recordings that alone has cut transcription cost several-fold because you stop paying to transcribe dead air. Second, accept that domain mismatch beats model size: a top model that never heard your jargon, accents, or product names will still botch them, and a little domain-specific post-processing — a find-replace dictionary, custom punctuation — fixes more real errors than upgrading the model. The transcript is the easy part; the pipeline around it is where usable quality comes from.

→ Wollen Sie das in Ihrem Stack?

AI Integration for Your App — ChatGPT, Claude & RAG

Your product already works. The goal here is to make it smarter, deflect repetitive support, turn your own content and data into answers, and automate the manual steps, without rebuilding from scratch...

So kann ich helfen