Encode images, videos, documents, and other media in a consistent format for agent execution and chat completions
Multi-modal inputs allow you to pass various types of content—text, images, videos, audio files, and documents—to agents in a consistent, type-safe format. The MessageContent type provides a unified interface for encoding different media types, whether you’re executing agents or using chat completions.
Use text for plain text instructions, questions, or prompts:
Copy
from vlmrun.types import MessageContent# Simple text inputtext_content = MessageContent(type="text", text="Analyze this image and describe what you see")
Use input_file with a file ID for files uploaded via the Files API. This is the recommended approach for files you want to manage through VLM Run’s file storage:
Copy
from vlmrun.types import MessageContentfrom vlmrun.client import VLMRunfrom pathlib import Pathclient = VLMRun(base_url="https://agent.vlm.run/v1", api_key="<VLMRUN_API_KEY>")# Step 1: Upload the filefile_response = client.files.upload(file=Path("local_image.jpg"))# Step 2: Use the file ID in MessageContentfile_content = MessageContent( type="input_file", file_id=file_response.id)
When using input_file, you can provide either file_id (from Files API upload) or file_url (presigned URL or public URL). The SDK automatically handles file retrieval and processing.
When working with multi-modal inputs, follow these guidelines:
Use input_file for production: Upload files via the Files API and use file_id for better security, access control, and file management. URL-based inputs are convenient for development and testing.
Specify image detail levels: Use detail="high" for images requiring fine-grained analysis (e.g., medical imaging, document OCR). Use detail="low" for faster processing when high detail isn’t needed.
Validate URLs before use: Ensure all URLs are publicly accessible and use HTTPS when possible. The SDK validates URL format but cannot verify accessibility.
Use typed input models: Define Pydantic models for agent execution inputs to leverage type checking, IDE autocompletion, and automatic validation.
Handle large files appropriately: For large videos or documents, prefer uploading via the Files API rather than using public URLs, as the Files API provides better error handling and progress tracking.
Combine text with media: Always include text instructions alongside media inputs to provide context and specify the desired operation.
For chat completions, you can mix text and media in a single message’s content array. This allows you to provide both instructions and the media to process in one request.