MessageContent type provides a unified interface for encoding different media types, whether you’re executing agents or using chat completions.
MessageContent Overview
MessageContent is a Pydantic model that encapsulates different types of input content with validation. It supports six content types:
| Type | Description | Use Case |
|---|---|---|
text | Plain text content | Instructions, questions, or text-based prompts |
image_url | Image from a public URL | Images hosted on the web or cloud storage |
video_url | Video from a public URL | Videos hosted on the web or cloud storage |
audio_url | Audio from a public URL | Audio files hosted on the web or cloud storage |
file_url | Generic file from a public URL | Documents (PDFs, Word docs, etc.) or other file types |
input_file | File uploaded via the Files API | Files uploaded to VLM Run storage using file IDs |
MessageContent and related types from the SDK:
Input Types
Text Input
Usetext for plain text instructions, questions, or prompts:
Image Input
Useimage_url for images accessible via HTTP/HTTPS URLs. The ImageUrl type supports an optional detail parameter to control image processing quality:
detail parameter accepts:
"auto"(default): Automatically determines the appropriate detail level"low": Lower resolution, faster processing"high": Higher resolution, more detailed analysis
Video Input
Usevideo_url for videos accessible via HTTP/HTTPS URLs:
Audio Input
Useaudio_url for audio files accessible via HTTP/HTTPS URLs:
Document / File Input (URL)
Usefile_url for documents and other file types accessible via HTTP/HTTPS URLs:
Document / File Input (Upload)
Useinput_file with a file ID for files uploaded via the Files API. This is the recommended approach for files you want to manage through VLM Run’s file storage:
Using Multi-modal Inputs
Agents can accept multiple inputs of different types. Define each input as a separate field in your input model:Python
In an Agent Execution
When executing agents, define typed and compound input models usingMessageContent for type safety and validation:
In a Chat Completion
For chat completions, use arrays of content objects in the OpenAI-compatible format. Each message can contain multiple content items:Toolset Selection
Thetoolsets parameter allows you to explicitly specify which tool categories the agent should use for processing your request. This gives you fine-grained control over the agent’s capabilities and can improve performance by limiting the tools to only those needed for your task.
Available Tool Categories
| Category | Description |
|---|---|
core | Essential tools for analyzing images, extracting content from documents, and processing video - the fundamental capabilities for most tasks |
image | Comprehensive image understanding including object detection, text recognition, UI element detection, segmentation, and visual quality assessment |
image-gen | Create new images from text descriptions, transform existing images, and apply visual effects like blurring regions |
world_gen | Generate 3D models from images, including object-level reconstruction and full scene reconstruction |
viz | Annotate images with bounding boxes, keypoints, and segmentation masks for visual output |
document | Process documents with layout detection, text extraction, content parsing, and structured data extraction |
video | Video processing capabilities including frame sampling, trimming, segmentation, and video generation |
web | Search the web for information to augment agent responses with real-time data |
Usage Examples
Best Practices
When working with multi-modal inputs, follow these guidelines:-
Use
input_filefor production: Upload files via the Files API and usefile_idfor better security, access control, and file management. URL-based inputs are convenient for development and testing. -
Specify image detail levels: Use
detail="high"for images requiring fine-grained analysis (e.g., medical imaging, document OCR). Usedetail="low"for faster processing when high detail isn’t needed. - Validate URLs before use: Ensure all URLs are publicly accessible and use HTTPS when possible. The SDK validates URL format but cannot verify accessibility.
- Use typed input models: Define Pydantic models for agent execution inputs to leverage type checking, IDE autocompletion, and automatic validation.
- Handle large files appropriately: For large videos or documents, prefer uploading via the Files API rather than using public URLs, as the Files API provides better error handling and progress tracking.
- Combine text with media: Always include text instructions alongside media inputs to provide context and specify the desired operation.
URL Validation
All URL-based input types (image_url, video_url, audio_url, file_url) require valid HTTP or HTTPS URLs. The SDK automatically validates URLs:
Common Use Cases
Image Analysis
Process images with text instructions for classification, object detection, or transformation.
Document Processing
Extract structured data from PDFs, Word documents, and other file formats.
Video Analysis
Analyze video content for transcription, scene detection, or frame extraction.
Multi-modal Tasks
Combine multiple input types (text, images, documents) for complex processing workflows.
Related Documentation
Artifacts
Learn how to retrieve generated artifacts from agent responses
Agent Execution
Execute agents with multi-modal inputs and retrieve structured results
Agent Creation
Create reusable agents that accept multi-modal inputs