Documentation Index
Fetch the complete documentation index at: https://docs.vlm.run/llms.txt
Use this file to discover all available pages before exploring further.
inputs field accepts a JSON object whose values are MessageContent items. Each value is a typed, discriminated union — the type field determines which modality is passed in as context for the agent. You can mix and match any number of modalities in a single request (e.g. a document + a reference image + a text instruction).
type | Payload field | Modality | When to use |
|---|---|---|---|
text | text | Plain text | Instructions, questions, or prompt context |
image_url | image_url.url (+ optional detail) | Image (URL) | Images hosted publicly (jpg, png, webp, …) |
video_url | video_url.url | Video (URL) | Videos hosted publicly (mp4, mov, …) |
audio_url | audio_url.url | Audio (URL) | Audio files hosted publicly (mp3, wav, …) |
file_url | file_url.url | Document / file (URL) | PDFs, Word docs, or any other file accessible over HTTP(S) |
input_file | file_id | Uploaded file | Files uploaded via POST /v1/files — pass the returned file.id |
email_body string or a structured metadata object to include alongside the uploaded file.
See the Multi-modal Inputs guide for the full reference on each modality, including detail levels for images / video, uploaded-file workflows, and typed Pydantic / Zod input models.
inputs object can freely mix every modality together with raw strings / JSON. The example below combines an uploaded file, a file URL, an image URL, a video URL, an audio URL, a text instruction, and two plain-primitive context fields (an HTML email body and a structured metadata object):
Bearer authentication header of the form Bearer <token>, where <token> is your auth token.
Request to execute an agent.
Optional metadata to pass to the model.
The configuration for the agent execution request.
Unique identifier of the request.
Date and time when the request was created (in UTC timezone)
The URL to call when the request is completed.
1VLM Run Agent model to use for execution
vlmrun-orion-1, vlmrun-orion-1:lite, vlmrun-orion-1:auto, vlmrun-orion-1:fast, vlmrun-orion-1:pro, vlmrun-orion-1.5, vlmrun-orion-1.5:lite, vlmrun-orion-1.5:auto, vlmrun-orion-1.5:fast, vlmrun-orion-1.5:pro Name of the agent. If not provided, we use the prompt to identify the unique agent.
Whether to process the document in batch mode (async).
The inputs to the agent.
List of tool categories to enable for this agent execution. Available categories: core, image, image-gen, world_gen, viz, document, video, web. When specified, only tools from these categories will be available. If None, defaults to 'core' tools only.
Available toolsets for agent tool selection.
Each toolset represents a category of related tools that can be enabled together for an agent execution.
core, document, image, image-gen, video, viz, web, world-gen List of model-specific tool providers to enable for this execution. Available models: depth-anything-3, google-gemini-3-analysis, google-gemini-3-image, google-gemini-robotics-er, google-veo-3.1, meta-sam2, meta-sam3, meta-sam3d, microsoft-omniparser-v2, nvidia-cosmos-reason-2-8b, qwen-qwen3-vl-8b, vlm-dots-ocr. Multiple models can be selected — their tools are merged.
Available models for agent tool selection.
Each model represents a specialized capability backed by a specific model deployment. Multiple models can be selected simultaneously — pass a list and the tools are merged and deduplicated.
Usage in vlmrun.yaml::
model: vlmrun-orion-1:auto
toolsets:
- core
- image
models:
- nvidia-cosmos-reason-2-8b
- meta-sam3
google-gemini-3-image, google-gemini-3-analysis, google-gemini-robotics-er, google-veo-3.1, microsoft-omniparser-v2, qwen-qwen3-vl-8b, meta-sam2, meta-sam3, meta-sam3d, depth-anything-3, vlm-dots-ocr, nvidia-cosmos-reason-2-8b Successful Response
Response to the agent execution request.
Name of the agent
The usage metrics for the request.
Unique identifier of the agent execution response.
The response from the model.
The status of the job.
pending, enqueued, running, completed, failed, paused Date and time when the execution was created (in UTC timezone)
Date and time when the execution was completed (in UTC timezone)