Multi-modal Inputs

Multi-modal inputs allow you to pass various types of content—text, images, videos, audio files, and documents—to agents in a consistent, type-safe format. The MessageContent type provides a unified interface for encoding different media types, whether you’re executing agents or using chat completions.

MessageContent Overview

MessageContent is a Pydantic model that encapsulates different types of input content with validation. It supports six content types:

Type	Description	Use Case
`text`	Plain text content	Instructions, questions, or text-based prompts
`image_url`	Image from a public URL	Images hosted on the web or cloud storage
`video_url`	Video from a public URL	Videos hosted on the web or cloud storage
`audio_url`	Audio from a public URL	Audio files hosted on the web or cloud storage
`input_file`	File uploaded via the Files API	Files uploaded to VLM Run storage using file IDs (recommended)
`file_url`	Generic file from a public URL	Documents (PDFs, Word docs, etc.) or other file types

Import MessageContent and related types from the SDK:

from vlmrun.types import MessageContent, ImageUrl, VideoUrl, AudioUrl, FileUrl

Input Types

Text Input

Use text for plain text instructions, questions, or prompts:

from vlmrun.types import MessageContent

# Simple text input
text_content = MessageContent(type="text", text="Analyze this image and describe what you see")

Image Input

Use image_url for images accessible via HTTP/HTTPS URLs. The ImageUrl type supports an optional detail parameter to control image processing quality:

from vlmrun.types import MessageContent, ImageUrl

# Image with default detail level (auto)
image_content = MessageContent(
    type="image_url",
    image_url=ImageUrl(url="https://example.com/photo.jpg")
)

# Image with high detail for better quality processing
image_content = MessageContent(
    type="image_url",
    image_url=ImageUrl(url="https://example.com/photo.jpg", detail="high")
)

The detail parameter accepts:

"auto" (default): Automatically determines the appropriate detail level
"low": Lower resolution, faster processing
"high": Higher resolution, more detailed analysis

Video Input

Use video_url for videos accessible via HTTP/HTTPS URLs:

from vlmrun.types import MessageContent, VideoUrl

video_content = MessageContent(
    type="video_url",
    video_url=VideoUrl(url="https://example.com/video.mp4")
)

Audio Input

Use audio_url for audio files accessible via HTTP/HTTPS URLs:

from vlmrun.types import MessageContent, AudioUrl

audio_content = MessageContent(
    type="audio_url",
    audio_url=AudioUrl(url="https://example.com/audio.mp3")
)

Document / File Input

Use input_file with a file ID for files uploaded via the Files API. This is the recommended approach for passing documents and other file types:

from vlmrun.types import MessageContent
from vlmrun.client import VLMRun
from pathlib import Path

client = VLMRun(base_url="https://agent.vlm.run/v1", api_key="<VLMRUN_API_KEY>")

# Step 1: Upload the file
file_response = client.files.upload(file=Path("local_image.jpg"))

# Step 2: Use the file ID in MessageContent
file_content = MessageContent(
    type="input_file",
    file_id=file_response.id
)

The input_file type with file_id is the recommended approach for all file inputs. It provides better security, access control, and file management compared to URL-based inputs.

Agents can accept multiple inputs of different types. Define each input as a separate field in your input model:

Python

from pydantic import BaseModel, Field
from vlmrun.types import MessageContent, ImageUrl

class MultiModalInputs(BaseModel):
    image: MessageContent = Field(..., description="The reference image")
    document: MessageContent = Field(..., description="The document to process")
    instruction: MessageContent = Field(..., description="Processing instructions")

inputs = MultiModalInputs(
    image=MessageContent(
        type="image_url",
        image_url=ImageUrl(url="https://example.com/reference.jpg")
    ),
    document=MessageContent(
        type="input_file",
        file_id="<file-id>"
    ),
    instruction=MessageContent(
        type="text",
        text="Extract information matching the reference image format"
    )
)

In an Agent Execution

When executing agents, define typed and compound input models using MessageContent for type safety and validation:

from pydantic import BaseModel, Field
from vlmrun.client import VLMRun
from vlmrun.client.types import AgentExecutionConfig, AgentExecutionResponse, ImageUrl
from vlmrun.types import MessageContent, ImageRef

client = VLMRun(base_url="https://agent.vlm.run/v1", api_key="<VLMRUN_API_KEY>")

# Define typed inputs using MessageContent
class ExecutionInputs(BaseModel):
    image: MessageContent = Field(..., description="The input image to process")
    ref_image: MessageContent = Field(..., description="The reference style image to use for style transfer")

class ImageResponse(BaseModel):
    image: ImageRef = Field(..., description="The stylized output image")

# Execute agent with image URL
execution: AgentExecutionResponse = client.agent.execute(
    name="image/blur-image",
    inputs=ExecutionInputs(
        image=MessageContent(
            type="image_url",
            image_url=ImageUrl(url="https://example.com/photo.jpg")
        ),
        reference=MessageContent(
            type="image_url",
            image_url=ImageUrl(url="https://example.com/style.jpg")
        )
    ),
    config=AgentExecutionConfig(
        prompt="Blur all faces in the image",
        response_model=ImageResponse
    )
)

In a Chat Completion

For chat completions, use arrays of content objects in the OpenAI-compatible format. Each message can contain multiple content items:

from vlmrun.client import VLMRun

client = VLMRun(base_url="https://agent.vlm.run/v1", api_key="<VLMRUN_API_KEY>")

response = client.agent.completions.create(
    model="vlmrun-orion-1:auto",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Blur all the faces in this image"},
                {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg", "detail": "auto"}}
            ]
        }
    ]
)

Toolset Selection

The toolsets parameter allows you to explicitly specify which tool categories the agent should use for processing your request. This gives you fine-grained control over the agent’s capabilities and can improve performance by limiting the tools to only those needed for your task.

Available Tool Categories

Category	Description
`core`	Essential tools for analyzing images, extracting content from documents, and processing video - the fundamental capabilities for most tasks
`image`	Comprehensive image understanding including object detection, text recognition, UI element detection, segmentation, and visual quality assessment
`image-gen`	Create new images from text descriptions, transform existing images, and apply visual effects like blurring regions
`world_gen`	Generate 3D models from images, including object-level reconstruction and full scene reconstruction
`viz`	Annotate images with bounding boxes, keypoints, and segmentation masks for visual output
`document`	Process documents with layout detection, text extraction, content parsing, and structured data extraction
`video`	Video processing capabilities including frame sampling, trimming, segmentation, and video generation
`web`	Search the web for information to augment agent responses with real-time data

Usage Examples

from vlmrun.client import VLMRun

client = VLMRun(base_url="https://agent.vlm.run/v1", api_key="<VLMRUN_API_KEY>")

# Use specific toolsets for image analysis
response = client.agent.completions.create(
    model="vlmrun-orion-1:auto",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Detect all objects in this image and draw bounding boxes"},
                {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}}
            ]
        }
    ],
    toolsets=["image", "viz"]
)

When you know exactly which capabilities your task requires, specifying toolsets can improve response time and reduce costs by avoiding unnecessary tool routing overhead.

Best Practices

When working with multi-modal inputs, follow these guidelines:

Use input_file for production: Upload files via the Files API and use file_id for better security, access control, and file management. URL-based inputs are convenient for development and testing.
Specify image detail levels: Use detail="high" for images requiring fine-grained analysis (e.g., medical imaging, document OCR). Use detail="low" for faster processing when high detail isn’t needed.
Validate URLs before use: Ensure all URLs are publicly accessible and use HTTPS when possible. The SDK validates URL format but cannot verify accessibility.
Use typed input models: Define Pydantic models for agent execution inputs to leverage type checking, IDE autocompletion, and automatic validation.
Handle large files appropriately: For large videos or documents, prefer uploading via the Files API rather than using public URLs, as the Files API provides better error handling and progress tracking.
Combine text with media: Always include text instructions alongside media inputs to provide context and specify the desired operation.

For chat completions, you can mix text and media in a single message’s content array. This allows you to provide both instructions and the media to process in one request.

URL Validation

All URL-based input types (image_url, video_url, audio_url) require valid HTTP or HTTPS URLs. The SDK automatically validates URLs:

from vlmrun.types import ImageUrl

# Valid - HTTP URL
image_url = ImageUrl(url="http://example.com/image.jpg")

# Valid - HTTPS URL
image_url = ImageUrl(url="https://example.com/image.jpg")

# Invalid - Will raise ValueError
image_url = ImageUrl(url="file:///local/path/image.jpg")  # ❌ Not HTTP/HTTPS

Common Use Cases

Image Analysis

Process images with text instructions for classification, object detection, or transformation.

Document Processing

Extract structured data from PDFs, Word documents, and other file formats.

Video Analysis

Analyze video content for transcription, scene detection, or frame extraction.

Multi-modal Tasks

Combine multiple input types (text, images, documents) for complex processing workflows.

Artifacts

Learn how to retrieve generated artifacts from agent responses

Agent Execution

Execute agents with multi-modal inputs and retrieve structured results

Agent Creation

Create reusable agents that accept multi-modal inputs

Get Started

Concepts

Image Capabilities

Document Capabilities

Video Capabilities

Misc

Multi-modal Inputs

MessageContent Overview

Input Types

Text Input

Image Input

Video Input

Audio Input

Document / File Input

In an Agent Execution

In a Chat Completion

Toolset Selection

Available Tool Categories

Usage Examples

Best Practices

URL Validation

Common Use Cases

Image Analysis

Document Processing

Video Analysis

Multi-modal Tasks

Artifacts

Agent Execution

Agent Creation

Get Started

Concepts

Image Capabilities

Document Capabilities

Video Capabilities

Misc

​MessageContent Overview

​Input Types

​Text Input

​Image Input

​Video Input

​Audio Input

​Document / File Input

​Using Multi-modal Inputs

​In an Agent Execution

​In a Chat Completion

​Toolset Selection

​Available Tool Categories

​Usage Examples

​Best Practices

​URL Validation

​Common Use Cases

Image Analysis

Document Processing

Video Analysis

Multi-modal Tasks

​Related Documentation

Artifacts

Agent Execution

Agent Creation

MessageContent Overview

Input Types

Text Input

Image Input

Video Input

Audio Input

Document / File Input

Using Multi-modal Inputs

In an Agent Execution

In a Chat Completion

Toolset Selection

Available Tool Categories

Usage Examples

Best Practices

URL Validation

Common Use Cases

Related Documentation