Skip to main content

Understanding Multi-Modal Models — Text, Image, Audio

5/40
Chapter 2 Multi-Modal Chatbot — LLMs, Gradio UI, and Function Calling

Understanding Multi-Modal Models — Text, Image, Audio

20 min read Lesson 5 / 40 Preview

Beyond Text: The Multi-Modal Revolution

Modern LLMs can see images, hear audio, and process multiple data types simultaneously. Understanding multi-modal capabilities unlocks entirely new categories of AI applications.

What Are Multi-Modal Models?

Multi-modal models process and generate multiple types of data:

Modality Input Example Output Example
Text Questions, instructions Answers, summaries
Images Photos, screenshots, charts Image descriptions, analysis
Audio Speech, music Transcription, voice synthesis
Video Video clips Frame analysis, descriptions
Code Source files Bug fixes, explanations

Vision Capabilities with GPT-4o

from openai import OpenAI
import base64

client = OpenAI()

def analyze_image(image_path, question):
    """Send an image to GPT-4o for analysis."""
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode()

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": question},
                {"type": "image_url", "image_url": {
                    "url": f"data:image/jpeg;base64,{image_data}",
                    "detail": "high"
                }}
            ]
        }],
        max_tokens=1000
    )
    return response.choices[0].message.content

# Analyze a chart
result = analyze_image("sales_chart.png", "What trends do you see in this chart?")
print(result)

# Read a receipt
result = analyze_image("receipt.jpg", "Extract all items and prices as JSON")
print(result)

Vision with Claude

from anthropic import Anthropic

anthropic = Anthropic()

def analyze_with_claude(image_path, question):
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode()

    import mimetypes
    media_type = mimetypes.guess_type(image_path)[0]

    response = anthropic.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {
                    "type": "base64",
                    "media_type": media_type,
                    "data": image_data
                }},
                {"type": "text", "text": question}
            ]
        }]
    )
    return response.content[0].text

Audio Transcription with Whisper

# OpenAI Whisper API
def transcribe_audio(audio_path):
    with open(audio_path, "rb") as audio_file:
        transcript = client.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file,
            response_format="verbose_json",
            timestamp_granularities=["segment"]
        )
    return transcript

result = transcribe_audio("meeting.mp3")
for segment in result.segments:
    print(f"[{segment.start:.1f}s] {segment.text}")

Practical Multi-Modal Use Cases

  1. Customer Support — User sends photo of damaged product, AI analyzes and generates claim
  2. Document Processing — Extract data from invoices, contracts, handwritten notes
  3. Accessibility — Describe images for visually impaired users
  4. Quality Control — Analyze product photos for defects
  5. Education — Students upload homework photos, AI provides feedback

Key Takeaway

Multi-modal AI transforms what applications can do. The ability to process images, audio, and text together enables use cases that were impossible just two years ago. Master these APIs to build next-generation products.