Skip to main content

Understanding Multi-Modal Models — Text, Image, Audio

5/40
Chapter 2 Multi-Modal Chatbot — LLMs, Gradio UI, and Function Calling

Understanding Multi-Modal Models — Text, Image, Audio

20 min read Lesson 5 / 40 Preview

Beyond Text: The Multi-Modal Revolution

Modern LLMs can see images, hear audio, and process multiple data types simultaneously. Understanding multi-modal capabilities unlocks entirely new categories of AI applications.

What Are Multi-Modal Models?

Multi-modal models process and generate multiple types of data:

Modality Input Example Output Example
Text Questions, instructions Answers, summaries
Images Photos, screenshots, charts Image descriptions, analysis
Audio Speech, music Transcription, voice synthesis
Video Video clips Frame analysis, descriptions
Code Source files Bug fixes, explanations

Vision Capabilities with GPT-4o

from openai import OpenAI
import base64

client = OpenAI()

def analyze_image(image_path, question):
    """Send an image to GPT-4o for analysis."""
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode()

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": question},
                {"type": "image_url", "image_url": {
                    "url": f"data:image/jpeg;base64,{image_data}",
                    "detail": "high"
                }}
            ]
        }],
        max_tokens=1000
    )
    return response.choices[0].message.content

# Analyze a chart
result = analyze_image("sales_chart.png", "What trends do you see in this chart?")
print(result)

# Read a receipt
result = analyze_image("receipt.jpg", "Extract all items and prices as JSON")
print(result)

Vision with Claude

from anthropic import Anthropic

anthropic = Anthropic()

def analyze_with_claude(image_path, question):
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode()

    import mimetypes
    media_type = mimetypes.guess_type(image_path)[0]

    response = anthropic.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {
                    "type": "base64",
                    "media_type": media_type,
                    "data": image_data
                }},
                {"type": "text", "text": question}
            ]
        }]
    )
    return response.content[0].text

Audio Transcription with Whisper

# OpenAI Whisper API
def transcribe_audio(audio_path):
    with open(audio_path, "rb") as audio_file:
        transcript = client.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file,
            response_format="verbose_json",
            timestamp_granularities=["segment"]
        )
    return transcript

result = transcribe_audio("meeting.mp3")
for segment in result.segments:
    print(f"[{segment.start:.1f}s] {segment.text}")

Practical Multi-Modal Use Cases

  1. Customer Support — User sends photo of damaged product, AI analyzes and generates claim
  2. Document Processing — Extract data from invoices, contracts, handwritten notes
  3. Accessibility — Describe images for visually impaired users
  4. Quality Control — Analyze product photos for defects
  5. Education — Students upload homework photos, AI provides feedback

Key Takeaway

Multi-modal AI transforms what applications can do. The ability to process images, audio, and text together enables use cases that were impossible just two years ago. Master these APIs to build next-generation products.

Engr Mejba Ahmed

Engr Mejba Ahmed

Claude Code Expert · Online

👋

Hey there!

Quick Actions

WhatsApp Instant reply

Chat on WhatsApp

+880 1723 741224 · Instant reply

Popular Questions

Engr Mejba Ahmed is connected
Engr Mejba Ahmed is typing...
Engr Mejba Ahmed avatar

✉ Want me to follow up? Drop your email

Engr Mejba Ahmed avatar

📞 Connect Directly

Choose how you'd like to reach me

WhatsApp

+880 1723 741224

Email

[email protected]

✓ Details sent! I'll get back to you shortly.

Powered by OpenAI

335+

Blog Posts

25

AI Courses

63

Projects

Services & Expertise

Pricing & Process

Learning & Resources

Connect & Support