The ear analogy
A human ear does not return phonemes or word boundaries. It just turns pressure waves into nerve signals. The brain — using context, expectations, lip-reading, prior conversation — turns those signals into "she said 'meet me at noon.'" The brain handles every step in one tangled process.
Modern automatic speech recognition (ASR) works the same way. One neural network ingests audio and outputs text. No phoneme dictionary, no separate language model — the network has internalised it all.
How a modern ASR pipeline looks
- Audio in — waveform sampled at 16 kHz (the standard for speech).
- Spectrogram / mel features — slice the audio into ~25 ms windows, compute frequency content, build a 2D image of "energy at each frequency over time." This is what the model actually sees.
- Encoder — a Transformer (or Conformer, an audio-tweaked variant) processes the mel features.
- Decoder / CTC head — emits text tokens conditioned on the encoder output.
Whisper and similar models are basically: mel-encoder + text-decoder + a lot of data.
Why one big model beat the old stack
Pre-2020 ASR was a chain: feature extraction → acoustic model (sounds → phonemes) → pronunciation lexicon (phonemes → words) → language model (words → sentence). Each stage had its own training, hyperparameters, and failure modes. Errors compounded.
End-to-end neural ASR fuses all stages into one network trained jointly on (audio, transcript) pairs. Every stage is now learned. Errors can no longer propagate through hand-coded interfaces because there are no interfaces.
Three things that make or break quality
- Domain match — a model trained on conversational English will struggle on medical dictation, courtroom audio, or accented speech it never saw. Domain fine-tuning matters more than model size.
- Audio quality — noise, low bitrate, far-field microphones, two speakers overlapping. The model is blamed; the audio is the problem.
- Streaming vs offline — offline can use full bidirectional context (high accuracy). Streaming has to commit to words as they come (lower accuracy, lower latency). The trade-off is real.
The metrics
- WER (word error rate) — fraction of words wrong (insertions + deletions + substitutions ÷ total). 5% on clean read speech is easy; 15% on noisy real-world audio is good; 25%+ is common on hard domains.
- Real-time factor (RTF) — processing time / audio duration. <1 means faster than real-time. Streaming systems target RTF ≪ 1 with low first-token latency.
Beyond plain transcription
Modern ASR systems often bundle:
- Speaker diarisation — "who said what."
- Punctuation and casing — adds the dots and commas the raw model lacks.
- Timestamps — per-word or per-sentence; essential for subtitling and search.
- Language ID — auto-detect the language before transcribing.
- Code-switching handling — multilingual mid-sentence (still hard for most systems).
Practical engineering
- Resample to 16 kHz mono before sending to the model, even if your source is 48 kHz stereo. Bandwidth saved, quality identical.
- Voice activity detection (VAD) strips silence so you don't pay for empty audio. Easy 2–5× cost savings on real call recordings.
- Chunk long audio with overlap so you don't drop words at the boundaries.
- Post-process — domain-specific find/replace ("fdr" → "Federal Reserve"), regex-based confidence flags, custom punctuation rules.
The model gives you a transcript. The pipeline around it makes that transcript usable.