Beyond Text: The Multi-Modal Revolution
Modern LLMs can see images, hear audio, and process multiple data types simultaneously. Understanding multi-modal capabilities unlocks entirely new categories of AI applications.
What Are Multi-Modal Models?
Multi-modal models process and generate multiple types of data:
| Modality | Input Example | Output Example |
|---|---|---|
| Text | Questions, instructions | Answers, summaries |
| Images | Photos, screenshots, charts | Image descriptions, analysis |
| Audio | Speech, music | Transcription, voice synthesis |
| Video | Video clips | Frame analysis, descriptions |
| Code | Source files | Bug fixes, explanations |
Vision Capabilities with GPT-4o
from openai import OpenAI
import base64
client = OpenAI()
def analyze_image(image_path, question):
"""Send an image to GPT-4o for analysis."""
with open(image_path, "rb") as f:
image_data = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": question},
{"type": "image_url", "image_url": {
"url": f"data:image/jpeg;base64,{image_data}",
"detail": "high"
}}
]
}],
max_tokens=1000
)
return response.choices[0].message.content
# Analyze a chart
result = analyze_image("sales_chart.png", "What trends do you see in this chart?")
print(result)
# Read a receipt
result = analyze_image("receipt.jpg", "Extract all items and prices as JSON")
print(result)
Vision with Claude
from anthropic import Anthropic
anthropic = Anthropic()
def analyze_with_claude(image_path, question):
with open(image_path, "rb") as f:
image_data = base64.b64encode(f.read()).decode()
import mimetypes
media_type = mimetypes.guess_type(image_path)[0]
response = anthropic.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {
"type": "base64",
"media_type": media_type,
"data": image_data
}},
{"type": "text", "text": question}
]
}]
)
return response.content[0].text
Audio Transcription with Whisper
# OpenAI Whisper API
def transcribe_audio(audio_path):
with open(audio_path, "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="verbose_json",
timestamp_granularities=["segment"]
)
return transcript
result = transcribe_audio("meeting.mp3")
for segment in result.segments:
print(f"[{segment.start:.1f}s] {segment.text}")
Practical Multi-Modal Use Cases
- Customer Support — User sends photo of damaged product, AI analyzes and generates claim
- Document Processing — Extract data from invoices, contracts, handwritten notes
- Accessibility — Describe images for visually impaired users
- Quality Control — Analyze product photos for defects
- Education — Students upload homework photos, AI provides feedback
Key Takeaway
Multi-modal AI transforms what applications can do. The ability to process images, audio, and text together enables use cases that were impossible just two years ago. Master these APIs to build next-generation products.