Skip to main content
POST
/
v1
/
agent
/
execute
!pip install vlmrun

from pathlib import Path
from pydantic import BaseModel, Field
from vlmrun.client import VLMRun
from vlmrun.client.types import AgentExecutionResponse, AgentExecutionConfig
from vlmrun.types import MessageContent, FileUrl

# Define a Pydantic model for the execution inputs
class ExecutionInputs(BaseModel):
  file: MessageContent = Field(..., description="The file to extract data from")

# Define a Pydantic model for the response
class Invoice(BaseModel):
  invoice_id: str = Field(..., description="The ID of the invoice")
  total_amount: float = Field(..., description="The total amount of the invoice")

client = VLMRun(api_key="<VLMRUN_API_KEY>")

# Upload the file to the object store
file = client.files.upload(file=Path("test.pdf"))

# Execute the agent (by name and version)
response: AgentExecutionResponse = client.agent.execute(
  name="<agent-name>:<agent-version>",
  inputs=ExecutionInputs(
    file=MessageContent(type="file_url", file_url=FileUrl(url=file.public_url))
  ),
  batch=True,
)

# Execute the agent (by inline prompt)
response: AgentExecutionResponse = client.agent.execute(
  inputs=ExecutionInputs(
    file=MessageContent(type="file_url", file_url=FileUrl(url=file.public_url))
  ),
  config=AgentExecutionConfig(
    prompt="Extract the invoice_id and total amount from the invoice.",
    response_model=Invoice,
  ),
  batch=True,
)
{
  "name": "<string>",
  "usage": {
    "elements_processed": 123,
    "element_type": "image",
    "credits_used": 123,
    "steps": 123,
    "message": "<string>",
    "duration_seconds": 0
  },
  "id": "<string>",
  "response": "<unknown>",
  "status": "pending",
  "created_at": "2023-11-07T05:31:56Z",
  "completed_at": "2023-11-07T05:31:56Z"
}

Documentation Index

Fetch the complete documentation index at: https://docs.vlm.run/llms.txt

Use this file to discover all available pages before exploring further.

Request Inputs

The inputs field accepts a JSON object whose values are MessageContent items. Each value is a typed, discriminated union — the type field determines which modality is passed in as context for the agent. You can mix and match any number of modalities in a single request (e.g. a document + a reference image + a text instruction).
typePayload fieldModalityWhen to use
texttextPlain textInstructions, questions, or prompt context
image_urlimage_url.url (+ optional detail)Image (URL)Images hosted publicly (jpg, png, webp, …)
video_urlvideo_url.urlVideo (URL)Videos hosted publicly (mp4, mov, …)
audio_urlaudio_url.urlAudio (URL)Audio files hosted publicly (mp3, wav, …)
file_urlfile_url.urlDocument / file (URL)PDFs, Word docs, or any other file accessible over HTTP(S)
input_filefile_idUploaded fileFiles uploaded via POST /v1/files — pass the returned file.id
Each slot can also be a plain JSON primitive (string, number, boolean, array, object) when the agent’s input schema declares a non-media field — e.g. an email_body string or a structured metadata object to include alongside the uploaded file. See the Multi-modal Inputs guide for the full reference on each modality, including detail levels for images / video, uploaded-file workflows, and typed Pydantic / Zod input models.
inputs is just a dictionary of named context slots — the keys are arbitrary (e.g. "file", "document", "reference_image", "instruction", "email_details") and match the input schema of your agent. Each value is either a MessageContent object of one of the types above, or a plain JSON primitive.

Generic payload — all input types

A single inputs object can freely mix every modality together with raw strings / JSON. The example below combines an uploaded file, a file URL, an image URL, a video URL, an audio URL, a text instruction, and two plain-primitive context fields (an HTML email body and a structured metadata object):
All input types
{
  "name": "<agent-name>:<agent-version>",
  "inputs": {
    "file": {
      "type": "input_file",
      "file_id": "dbb28d43-d741-4e0c-b25b-04ddc69b3197"
    },
    "supporting_document": {
      "type": "file_url",
      "file_url": { "url": "https://example.com/referral.pdf" }
    },
    "reference_image": {
      "type": "image_url",
      "image_url": { "url": "https://example.com/layout.png", "detail": "high" }
    },
    "demo_video": {
      "type": "video_url",
      "video_url": { "url": "https://example.com/clip.mp4" }
    },
    "voicemail": {
      "type": "audio_url",
      "audio_url": { "url": "https://example.com/voicemail.mp3" }
    },
    "instruction": {
      "type": "text",
      "text": "Schedule the patient and confirm insurance eligibility."
    },
    "email_details": "<div dir=\"ltr\">Hi,<br />Please see the attached order form for Oscar Bhujel. Kindly let us know once the appointment is scheduled.<br />Thank you,<br />Camielle Jane Lim</div>",
    "metadata": {
      "received_at": "2026-04-20T16:30:00Z",
      "priority": "normal",
      "source": "gmail"
    }
  },
  "batch": true
}

Minimal payload shapes

Document (PDF, Word, etc.) via URL
{
  "name": "<agent-name>:<agent-version>",
  "inputs": {
    "file": { "type": "file_url", "file_url": { "url": "https://example.com/invoice.pdf" } }
  }
}
Document via uploaded file ID
{
  "name": "<agent-name>:<agent-version>",
  "inputs": {
    "file": { "type": "input_file", "file_id": "file_abc123" }
  }
}
Image + text instruction
{
  "name": "<agent-name>:<agent-version>",
  "inputs": {
    "image": { "type": "image_url", "image_url": { "url": "https://example.com/photo.jpg", "detail": "high" } },
    "instruction": { "type": "text", "text": "Describe the product in the image." }
  }
}
Video + reference image
{
  "name": "<agent-name>:<agent-version>",
  "inputs": {
    "video": { "type": "video_url", "video_url": { "url": "https://example.com/clip.mp4" } },
    "reference": { "type": "image_url", "image_url": { "url": "https://example.com/style.jpg" } }
  }
}
Audio transcription
{
  "name": "<agent-name>:<agent-version>",
  "inputs": {
    "audio": { "type": "audio_url", "audio_url": { "url": "https://example.com/meeting.mp3" } }
  }
}
Uploaded file + raw string / JSON context
{
  "name": "<agent-name>:<agent-version>",
  "inputs": {
    "file": { "type": "input_file", "file_id": "dbb28d43-d741-4e0c-b25b-04ddc69b3197" },
    "email_details": "<div>Please see the attached order form. Let us know once scheduled.</div>",
    "metadata": { "received_at": "2026-04-20T16:30:00Z", "source": "gmail" }
  }
}
Every /agent/execute request can also set service_tier at the top level of the body to control both pricing and request routing:
  • standard (default) — 1.0× baseline rates and latency.
  • flex0.5× cost (50% off), higher latency. Best for bulk / backfill jobs.
  • priority — 1.8× cost, lowest latency. Best for interactive workflows.
See the pricing guide for full details.
!pip install vlmrun

from pathlib import Path
from pydantic import BaseModel, Field
from vlmrun.client import VLMRun
from vlmrun.client.types import AgentExecutionResponse, AgentExecutionConfig
from vlmrun.types import MessageContent, FileUrl

# Define a Pydantic model for the execution inputs
class ExecutionInputs(BaseModel):
  file: MessageContent = Field(..., description="The file to extract data from")

# Define a Pydantic model for the response
class Invoice(BaseModel):
  invoice_id: str = Field(..., description="The ID of the invoice")
  total_amount: float = Field(..., description="The total amount of the invoice")

client = VLMRun(api_key="<VLMRUN_API_KEY>")

# Upload the file to the object store
file = client.files.upload(file=Path("test.pdf"))

# Execute the agent (by name and version)
response: AgentExecutionResponse = client.agent.execute(
  name="<agent-name>:<agent-version>",
  inputs=ExecutionInputs(
    file=MessageContent(type="file_url", file_url=FileUrl(url=file.public_url))
  ),
  batch=True,
)

# Execute the agent (by inline prompt)
response: AgentExecutionResponse = client.agent.execute(
  inputs=ExecutionInputs(
    file=MessageContent(type="file_url", file_url=FileUrl(url=file.public_url))
  ),
  config=AgentExecutionConfig(
    prompt="Extract the invoice_id and total amount from the invoice.",
    response_model=Invoice,
  ),
  batch=True,
)

Authorizations

Authorization
string
header
required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json

Request to execute an agent.

metadata
RequestMetadata · object

Optional metadata to pass to the model.

config
AgentExecutionConfig · object

The configuration for the agent execution request.

id
string

Unique identifier of the request.

created_at
string<date-time>

Date and time when the request was created (in UTC timezone)

callback_url
string<uri> | null

The URL to call when the request is completed.

Minimum string length: 1
model
enum<string>
default:vlmrun-orion-1:auto

VLM Run Agent model to use for execution

Available options:
vlmrun-orion-1,
vlmrun-orion-1:lite,
vlmrun-orion-1:auto,
vlmrun-orion-1:fast,
vlmrun-orion-1:pro,
vlmrun-orion-1.5,
vlmrun-orion-1.5:lite,
vlmrun-orion-1.5:auto,
vlmrun-orion-1.5:fast,
vlmrun-orion-1.5:pro
name
string | null

Name of the agent. If not provided, we use the prompt to identify the unique agent.

batch
boolean
default:true

Whether to process the document in batch mode (async).

inputs
AgentExecutionInputs · object

The inputs to the agent.

toolsets
enum<string>[] | null

List of tool categories to enable for this agent execution. Available categories: core, image, image-gen, world_gen, viz, document, video, web. When specified, only tools from these categories will be available. If None, defaults to 'core' tools only.

Available toolsets for agent tool selection.

Each toolset represents a category of related tools that can be enabled together for an agent execution.

Available options:
core,
document,
image,
image-gen,
video,
viz,
web,
world-gen
models
enum<string>[] | null

List of model-specific tool providers to enable for this execution. Available models: depth-anything-3, google-gemini-3-analysis, google-gemini-3-image, google-gemini-robotics-er, google-veo-3.1, meta-sam2, meta-sam3, meta-sam3d, microsoft-omniparser-v2, nvidia-cosmos-reason-2-8b, qwen-qwen3-vl-8b, vlm-dots-ocr. Multiple models can be selected — their tools are merged.

Available models for agent tool selection.

Each model represents a specialized capability backed by a specific model deployment. Multiple models can be selected simultaneously — pass a list and the tools are merged and deduplicated.

Usage in vlmrun.yaml::

model: vlmrun-orion-1:auto
toolsets:
  - core
  - image
models:
  - nvidia-cosmos-reason-2-8b
  - meta-sam3
Available options:
google-gemini-3-image,
google-gemini-3-analysis,
google-gemini-robotics-er,
google-veo-3.1,
microsoft-omniparser-v2,
qwen-qwen3-vl-8b,
meta-sam2,
meta-sam3,
meta-sam3d,
depth-anything-3,
vlm-dots-ocr,
nvidia-cosmos-reason-2-8b

Response

Successful Response

Response to the agent execution request.

name
string
required

Name of the agent

usage
CreditUsageResponse · object

The usage metrics for the request.

id
string

Unique identifier of the agent execution response.

response
any | null

The response from the model.

status
enum<string>
default:pending

The status of the job.

Available options:
pending,
enqueued,
running,
completed,
failed,
paused
created_at
string<date-time>

Date and time when the execution was created (in UTC timezone)

completed_at
string<date-time> | null

Date and time when the execution was completed (in UTC timezone)