Skip to main content
Multi-modal inputs allow you to pass various types of content—text, images, videos, audio files, and documents—to agents in a consistent, type-safe format. The MessageContent type provides a unified interface for encoding different media types, whether you’re executing agents or using chat completions.

MessageContent Overview

MessageContent is a Pydantic model that encapsulates different types of input content with validation. It supports six content types:
TypeDescriptionUse Case
textPlain text contentInstructions, questions, or text-based prompts
image_urlImage from a public URLImages hosted on the web or cloud storage
video_urlVideo from a public URLVideos hosted on the web or cloud storage
audio_urlAudio from a public URLAudio files hosted on the web or cloud storage
file_urlGeneric file from a public URLDocuments (PDFs, Word docs, etc.) or other file types
input_fileFile uploaded via the Files APIFiles uploaded to VLM Run storage using file IDs
Import MessageContent and related types from the SDK:
from vlmrun.types import MessageContent, ImageUrl, VideoUrl, AudioUrl, FileUrl

Input Types

Text Input

Use text for plain text instructions, questions, or prompts:
from vlmrun.types import MessageContent

# Simple text input
text_content = MessageContent(type="text", text="Analyze this image and describe what you see")

Image Input

Use image_url for images accessible via HTTP/HTTPS URLs. The ImageUrl type supports an optional detail parameter to control image processing quality:
from vlmrun.types import MessageContent, ImageUrl

# Image with default detail level (auto)
image_content = MessageContent(
    type="image_url",
    image_url=ImageUrl(url="https://example.com/photo.jpg")
)

# Image with high detail for better quality processing
image_content = MessageContent(
    type="image_url",
    image_url=ImageUrl(url="https://example.com/photo.jpg", detail="high")
)
The detail parameter accepts:
  • "auto" (default): Automatically determines the appropriate detail level
  • "low": Lower resolution, faster processing
  • "high": Higher resolution, more detailed analysis

Video Input

Use video_url for videos accessible via HTTP/HTTPS URLs:
from vlmrun.types import MessageContent, VideoUrl

video_content = MessageContent(
    type="video_url",
    video_url=VideoUrl(url="https://example.com/video.mp4")
)

Audio Input

Use audio_url for audio files accessible via HTTP/HTTPS URLs:
from vlmrun.types import MessageContent, AudioUrl

audio_content = MessageContent(
    type="audio_url",
    audio_url=AudioUrl(url="https://example.com/audio.mp3")
)

Document / File Input (URL)

Use file_url for documents and other file types accessible via HTTP/HTTPS URLs:
from vlmrun.types import MessageContent, FileUrl

document_content = MessageContent(
    type="file_url",
    file_url=FileUrl(url="https://example.com/document.pdf")
)

Document / File Input (Upload)

Use input_file with a file ID for files uploaded via the Files API. This is the recommended approach for files you want to manage through VLM Run’s file storage:
from vlmrun.types import MessageContent
from vlmrun.client import VLMRun
from pathlib import Path

client = VLMRun(base_url="https://agent.vlm.run/v1", api_key="<VLMRUN_API_KEY>")

# Step 1: Upload the file
file_response = client.files.upload(file=Path("local_image.jpg"))

# Step 2: Use the file ID in MessageContent
file_content = MessageContent(
    type="input_file",
    file_id=file_response.id
)
When using input_file, you can provide either file_id (from Files API upload) or file_url (presigned URL or public URL). The SDK automatically handles file retrieval and processing.

Using Multi-modal Inputs

Agents can accept multiple inputs of different types. Define each input as a separate field in your input model:
Python
from pydantic import BaseModel, Field
from vlmrun.types import MessageContent, ImageUrl, FileUrl

class MultiModalInputs(BaseModel):
    image: MessageContent = Field(..., description="The reference image")
    document: MessageContent = Field(..., description="The document to process")
    instruction: MessageContent = Field(..., description="Processing instructions")

inputs = MultiModalInputs(
    image=MessageContent(
        type="image_url",
        image_url=ImageUrl(url="https://example.com/reference.jpg")
    ),
    document=MessageContent(
        type="file_url",
        file_url=FileUrl(url="https://example.com/document.pdf")
    ),
    instruction=MessageContent(
        type="text",
        text="Extract information matching the reference image format"
    )
)

In an Agent Execution

When executing agents, define typed and compound input models using MessageContent for type safety and validation:
from pydantic import BaseModel, Field
from vlmrun.client import VLMRun
from vlmrun.client.types import AgentExecutionConfig, AgentExecutionResponse, ImageUrl
from vlmrun.types import MessageContent, ImageRef

client = VLMRun(base_url="https://agent.vlm.run/v1", api_key="<VLMRUN_API_KEY>")

# Define typed inputs using MessageContent
class ExecutionInputs(BaseModel):
    image: MessageContent = Field(..., description="The input image to process")
    ref_image: MessageContent = Field(..., description="The reference style image to use for style transfer")

class ImageResponse(BaseModel):
    image: ImageRef = Field(..., description="The stylized output image")

# Execute agent with image URL
execution: AgentExecutionResponse = client.agent.execute(
    name="image/blur-image",
    inputs=ExecutionInputs(
        image=MessageContent(
            type="image_url",
            image_url=ImageUrl(url="https://example.com/photo.jpg")
        ),
        reference=MessageContent(
            type="image_url",
            image_url=ImageUrl(url="https://example.com/style.jpg")
        )
    ),
    config=AgentExecutionConfig(
        prompt="Blur all faces in the image",
        response_model=ImageResponse
    )
)

In a Chat Completion

For chat completions, use arrays of content objects in the OpenAI-compatible format. Each message can contain multiple content items:
from vlmrun.client import VLMRun

client = VLMRun(base_url="https://agent.vlm.run/v1", api_key="<VLMRUN_API_KEY>")

response = client.agent.completions.create(
    model="vlmrun-orion-1:auto",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Blur all the faces in this image"},
                {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg", "detail": "auto"}}
            ]
        }
    ]
)

Best Practices

When working with multi-modal inputs, follow these guidelines:
  • Use input_file for production: Upload files via the Files API and use file_id for better security, access control, and file management. URL-based inputs are convenient for development and testing.
  • Specify image detail levels: Use detail="high" for images requiring fine-grained analysis (e.g., medical imaging, document OCR). Use detail="low" for faster processing when high detail isn’t needed.
  • Validate URLs before use: Ensure all URLs are publicly accessible and use HTTPS when possible. The SDK validates URL format but cannot verify accessibility.
  • Use typed input models: Define Pydantic models for agent execution inputs to leverage type checking, IDE autocompletion, and automatic validation.
  • Handle large files appropriately: For large videos or documents, prefer uploading via the Files API rather than using public URLs, as the Files API provides better error handling and progress tracking.
  • Combine text with media: Always include text instructions alongside media inputs to provide context and specify the desired operation.
For chat completions, you can mix text and media in a single message’s content array. This allows you to provide both instructions and the media to process in one request.

URL Validation

All URL-based input types (image_url, video_url, audio_url, file_url) require valid HTTP or HTTPS URLs. The SDK automatically validates URLs:
from vlmrun.types import ImageUrl

# Valid - HTTP URL
image_url = ImageUrl(url="http://example.com/image.jpg")

# Valid - HTTPS URL
image_url = ImageUrl(url="https://example.com/image.jpg")

# Invalid - Will raise ValueError
image_url = ImageUrl(url="file:///local/path/image.jpg")  # ❌ Not HTTP/HTTPS

Common Use Cases

Image Analysis

Process images with text instructions for classification, object detection, or transformation.

Document Processing

Extract structured data from PDFs, Word documents, and other file formats.

Video Analysis

Analyze video content for transcription, scene detection, or frame extraction.

Multi-modal Tasks

Combine multiple input types (text, images, documents) for complex processing workflows.

Artifacts

Learn how to retrieve generated artifacts from agent responses

Agent Execution

Execute agents with multi-modal inputs and retrieve structured results

Agent Creation

Create reusable agents that accept multi-modal inputs