The orchestra analogy
Strings, brass, percussion, voice — each section has its own language. A score does not turn them into one instrument. A conductor keeps them in shared time and key so the audience hears one piece, not four parallel concerts.
A multimodal model is the conductor. Each modality has its own encoder (its own "section"). The fusion layer is the conductor — it puts every embedding into a shared representational space and lets the rest of the network reason across all of them as if it were one input.
The fusion-pattern map
Three classic places to fuse:
- Early fusion — concatenate raw features (or near-raw embeddings) at the input. The model has to learn cross-modal relationships from scratch. Powerful but data-hungry.
- Mid fusion (most modern systems) — each modality has its own encoder, then projections merge them in the language model's input stream. Plays well with pretrained components.
- Late fusion — each modality produces its own decision; a small head combines them. Easy to engineer, weak at deep cross-modal reasoning.
Modern frontier multimodal models (GPT-4o, Gemini, Claude with vision/voice) are mid-fusion.
What "shared space" actually means
A picture of a dog and the word "dog" should land near each other in the model's representation. A picture of a violin and the word "dog" should land far apart. This is cross-modal alignment.
The training trick that delivers it (CLIP-style):
- Pull pairs (image, caption) from the web.
- Encode each side; train so matched pairs are close, mismatched pairs are far (contrastive loss).
- After training, the spaces are aligned enough that "show me images near this caption" works.
Audio-text alignment uses similar tricks (CLAP, audio-CLIP, transcript-paired audio).
Three serious challenges
- Token economy. Images can cost hundreds to thousands of tokens; audio can be similarly heavy. Multimodal context windows are dominated by non-text modalities. Engineer for it.
- Modality dominance. During training, the loss can come overwhelmingly from the easy modality (usually text), and the model under-uses the harder one (image). Loss weighting and curriculum sampling fight this.
- Evaluation. Single-modality benchmarks miss the point. You need multimodal evals: VQA, audio-grounded reasoning, video QA, document understanding.
Where multimodal is actually shipping
- UI agents — see a screen, decide a click.
- Document AI — parse a PDF that has tables, charts, and text.
- Real-time voice assistants — audio in, audio out, text-language-model in the middle (or end-to-end speech-to-speech).
- Robotics — vision + proprioception + language instructions in a single policy.
- Accessibility — describe images for blind users, transcribe + translate speech in real time.
What to think about as a builder
- Input compression matters more than model size. Half the fight is getting the modality into a sensible token budget.
- Provenance and citation are harder across modalities. "Where in the image / audio is the answer grounded?" is a real, often unsolved, requirement.
- Mixed-modality output is still rare and rough. Most production systems take many modalities in but emit text only. Image- or audio-out exists but is still mostly via separate generative models pinned to the conversation.
Multimodal is not a different paradigm. It is the same Transformer with extra encoders and a richer loss. The breakthroughs are in data curation and engineering — not in architecture.