Skip to main content
Multimodal AI MCP handshake 3 sliders

Speech-to-Text: From Sound Waves to Sentences

Modern ASR is one big neural network: audio in, text out. The pipeline used to be five hand-tuned stages; now it is a single Transformer.

· 2 min lezen
Naar het lab
▸ Probeer het zelf

Sleep een slider — het diagram reageert in real time.

FR /100
¶ De analogie

The ear analogy

A human ear does not return phonemes or word boundaries. It just turns pressure waves into nerve signals. The brain — using context, expectations, lip-reading, prior conversation — turns those signals into "she said 'meet me at noon.'" The brain handles every step in one tangled process.

Modern automatic speech recognition (ASR) works the same way. One neural network ingests audio and outputs text. No phoneme dictionary, no separate language model — the network has internalised it all.

How a modern ASR pipeline looks

  1. Audio in — waveform sampled at 16 kHz (the standard for speech).
  2. Spectrogram / mel features — slice the audio into ~25 ms windows, compute frequency content, build a 2D image of "energy at each frequency over time." This is what the model actually sees.
  3. Encoder — a Transformer (or Conformer, an audio-tweaked variant) processes the mel features.
  4. Decoder / CTC head — emits text tokens conditioned on the encoder output.

Whisper and similar models are basically: mel-encoder + text-decoder + a lot of data.

Why one big model beat the old stack

Pre-2020 ASR was a chain: feature extraction → acoustic model (sounds → phonemes) → pronunciation lexicon (phonemes → words) → language model (words → sentence). Each stage had its own training, hyperparameters, and failure modes. Errors compounded.

End-to-end neural ASR fuses all stages into one network trained jointly on (audio, transcript) pairs. Every stage is now learned. Errors can no longer propagate through hand-coded interfaces because there are no interfaces.

Three things that make or break quality

  • Domain match — a model trained on conversational English will struggle on medical dictation, courtroom audio, or accented speech it never saw. Domain fine-tuning matters more than model size.
  • Audio quality — noise, low bitrate, far-field microphones, two speakers overlapping. The model is blamed; the audio is the problem.
  • Streaming vs offline — offline can use full bidirectional context (high accuracy). Streaming has to commit to words as they come (lower accuracy, lower latency). The trade-off is real.

The metrics

  • WER (word error rate) — fraction of words wrong (insertions + deletions + substitutions ÷ total). 5% on clean read speech is easy; 15% on noisy real-world audio is good; 25%+ is common on hard domains.
  • Real-time factor (RTF) — processing time / audio duration. <1 means faster than real-time. Streaming systems target RTF ≪ 1 with low first-token latency.

Beyond plain transcription

Modern ASR systems often bundle:

  • Speaker diarisation — "who said what."
  • Punctuation and casing — adds the dots and commas the raw model lacks.
  • Timestamps — per-word or per-sentence; essential for subtitling and search.
  • Language ID — auto-detect the language before transcribing.
  • Code-switching handling — multilingual mid-sentence (still hard for most systems).

Practical engineering

  • Resample to 16 kHz mono before sending to the model, even if your source is 48 kHz stereo. Bandwidth saved, quality identical.
  • Voice activity detection (VAD) strips silence so you don't pay for empty audio. Easy 2–5× cost savings on real call recordings.
  • Chunk long audio with overlap so you don't drop words at the boundaries.
  • Post-process — domain-specific find/replace ("fdr" → "Federal Reserve"), regex-based confidence flags, custom punctuation rules.

The model gives you a transcript. The pipeline around it makes that transcript usable.

Engr Mejba Ahmed

Engr Mejba Ahmed

Claude Code Expert · Online

👋

Hey there!

Quick Actions

WhatsApp Instant reply

Chat on WhatsApp

+880 1723 741224 · Instant reply

Popular Questions

Engr Mejba Ahmed is connected
Engr Mejba Ahmed is typing...
Engr Mejba Ahmed avatar

✉ Want me to follow up? Drop your email

Engr Mejba Ahmed avatar

📞 Connect Directly

Choose how you'd like to reach me

WhatsApp

+880 1723 741224

Email

[email protected]

✓ Details sent! I'll get back to you shortly.

Powered by OpenAI

335+

Blog Posts

25

AI Courses

63

Projects

Services & Expertise

Pricing & Process

Learning & Resources

Connect & Support