Execute Agent - VLM Run

!pip install vlmrun

from pathlib import Path
from pydantic import BaseModel, Field
from vlmrun.client import VLMRun
from vlmrun.client.types import AgentExecutionResponse, AgentExecutionConfig
from vlmrun.types import MessageContent, FileUrl

# Define a Pydantic model for the execution inputs
class ExecutionInputs(BaseModel):
  file: MessageContent = Field(..., description="The file to extract data from")

# Define a Pydantic model for the response
class Invoice(BaseModel):
  invoice_id: str = Field(..., description="The ID of the invoice")
  total_amount: float = Field(..., description="The total amount of the invoice")

client = VLMRun(api_key="<VLMRUN_API_KEY>")

# Upload the file to the object store
file = client.files.upload(file=Path("test.pdf"))

# Execute the agent (by name and version)
response: AgentExecutionResponse = client.agent.execute(
  name="<agent-name>:<agent-version>",
  inputs=ExecutionInputs(
    file=MessageContent(type="file_url", file_url=FileUrl(url=file.public_url))
  ),
  batch=True,
)

# Execute the agent (by inline prompt)
response: AgentExecutionResponse = client.agent.execute(
  inputs=ExecutionInputs(
    file=MessageContent(type="file_url", file_url=FileUrl(url=file.public_url))
  ),
  config=AgentExecutionConfig(
    prompt="Extract the invoice_id and total amount from the invoice.",
    response_model=Invoice,
  ),
  batch=True,
)

{
  "name": "<string>",
  "usage": {
    "elements_processed": 123,
    "credits_used": 123,
    "steps": 123,
    "message": "<string>",
    "duration_seconds": 0
  },
  "id": "<string>",
  "response": "<unknown>",
  "status": "pending",
  "created_at": "2023-11-07T05:31:56Z",
  "completed_at": "2023-11-07T05:31:56Z"
}

POST

agent

execute

!pip install vlmrun

from pathlib import Path
from pydantic import BaseModel, Field
from vlmrun.client import VLMRun
from vlmrun.client.types import AgentExecutionResponse, AgentExecutionConfig
from vlmrun.types import MessageContent, FileUrl

# Define a Pydantic model for the execution inputs
class ExecutionInputs(BaseModel):
  file: MessageContent = Field(..., description="The file to extract data from")

# Define a Pydantic model for the response
class Invoice(BaseModel):
  invoice_id: str = Field(..., description="The ID of the invoice")
  total_amount: float = Field(..., description="The total amount of the invoice")

client = VLMRun(api_key="<VLMRUN_API_KEY>")

# Upload the file to the object store
file = client.files.upload(file=Path("test.pdf"))

# Execute the agent (by name and version)
response: AgentExecutionResponse = client.agent.execute(
  name="<agent-name>:<agent-version>",
  inputs=ExecutionInputs(
    file=MessageContent(type="file_url", file_url=FileUrl(url=file.public_url))
  ),
  batch=True,
)

# Execute the agent (by inline prompt)
response: AgentExecutionResponse = client.agent.execute(
  inputs=ExecutionInputs(
    file=MessageContent(type="file_url", file_url=FileUrl(url=file.public_url))
  ),
  config=AgentExecutionConfig(
    prompt="Extract the invoice_id and total amount from the invoice.",
    response_model=Invoice,
  ),
  batch=True,
)

{
  "name": "<string>",
  "usage": {
    "elements_processed": 123,
    "credits_used": 123,
    "steps": 123,
    "message": "<string>",
    "duration_seconds": 0
  },
  "id": "<string>",
  "response": "<unknown>",
  "status": "pending",
  "created_at": "2023-11-07T05:31:56Z",
  "completed_at": "2023-11-07T05:31:56Z"
}

Request Inputs

The inputs field accepts a JSON object whose values are MessageContent items. Each value is a typed, discriminated union — the type field determines which modality is passed in as context for the agent. You can mix and match any number of modalities in a single request (e.g. a document + a reference image + a text instruction).

`type`	Payload field	Modality	When to use
`text`	`text`	Plain text	Instructions, questions, or prompt context
`image_url`	`image_url.url` (+ optional `detail`)	Image (URL)	Images hosted publicly (`jpg`, `png`, `webp`, …)
`video_url`	`video_url.url`	Video (URL)	Videos hosted publicly (`mp4`, `mov`, …)
`audio_url`	`audio_url.url`	Audio (URL)	Audio files hosted publicly (`mp3`, `wav`, …)
`file_url`	`file_url.url`	Document / file (URL)	PDFs, Word docs, or any other file accessible over HTTP(S)
`input_file`	`file_id`	Uploaded file	Files uploaded via `POST /v1/files` — pass the returned `file.id`

Each slot can also be a plain JSON primitive (string, number, boolean, array, object) when the agent’s input schema declares a non-media field — e.g. an email_body string or a structured metadata object to include alongside the uploaded file. See the Multi-modal Inputs guide for the full reference on each modality, including detail levels for images / video, uploaded-file workflows, and typed Pydantic / Zod input models.

inputs is just a dictionary of named context slots — the keys are arbitrary (e.g. "file", "document", "reference_image", "instruction", "email_details") and match the input schema of your agent. Each value is either a MessageContent object of one of the types above, or a plain JSON primitive.

Generic payload — all input types

A single inputs object can freely mix every modality together with raw strings / JSON. The example below combines an uploaded file, a file URL, an image URL, a video URL, an audio URL, a text instruction, and two plain-primitive context fields (an HTML email body and a structured metadata object):

All input types

{
  "name": "<agent-name>:<agent-version>",
  "inputs": {
    "file": {
      "type": "input_file",
      "file_id": "dbb28d43-d741-4e0c-b25b-04ddc69b3197"
    },
    "supporting_document": {
      "type": "file_url",
      "file_url": { "url": "https://example.com/referral.pdf" }
    },
    "reference_image": {
      "type": "image_url",
      "image_url": { "url": "https://example.com/layout.png", "detail": "high" }
    },
    "demo_video": {
      "type": "video_url",
      "video_url": { "url": "https://example.com/clip.mp4" }
    },
    "voicemail": {
      "type": "audio_url",
      "audio_url": { "url": "https://example.com/voicemail.mp3" }
    },
    "instruction": {
      "type": "text",
      "text": "Schedule the patient and confirm insurance eligibility."
    },
    "email_details": "<div dir=\"ltr\">Hi,<br />Please see the attached order form for Oscar Bhujel. Kindly let us know once the appointment is scheduled.<br />Thank you,<br />Camielle Jane Lim</div>",
    "metadata": {
      "received_at": "2026-04-20T16:30:00Z",
      "priority": "normal",
      "source": "gmail"
    }
  },
  "batch": true
}

Minimal payload shapes

Document (PDF, Word, etc.) via URL

{
  "name": "<agent-name>:<agent-version>",
  "inputs": {
    "file": { "type": "file_url", "file_url": { "url": "https://example.com/invoice.pdf" } }
  }
}

Document via uploaded file ID

{
  "name": "<agent-name>:<agent-version>",
  "inputs": {
    "file": { "type": "input_file", "file_id": "file_abc123" }
  }
}

Image + text instruction

{
  "name": "<agent-name>:<agent-version>",
  "inputs": {
    "image": { "type": "image_url", "image_url": { "url": "https://example.com/photo.jpg", "detail": "high" } },
    "instruction": { "type": "text", "text": "Describe the product in the image." }
  }
}

Video + reference image

{
  "name": "<agent-name>:<agent-version>",
  "inputs": {
    "video": { "type": "video_url", "video_url": { "url": "https://example.com/clip.mp4" } },
    "reference": { "type": "image_url", "image_url": { "url": "https://example.com/style.jpg" } }
  }
}

Audio transcription

{
  "name": "<agent-name>:<agent-version>",
  "inputs": {
    "audio": { "type": "audio_url", "audio_url": { "url": "https://example.com/meeting.mp3" } }
  }
}

Uploaded file + raw string / JSON context

{
  "name": "<agent-name>:<agent-version>",
  "inputs": {
    "file": { "type": "input_file", "file_id": "dbb28d43-d741-4e0c-b25b-04ddc69b3197" },
    "email_details": "<div>Please see the attached order form. Let us know once scheduled.</div>",
    "metadata": { "received_at": "2026-04-20T16:30:00Z", "source": "gmail" }
  }
}

Set config.service_tier to control both billing and request routing — mirroring OpenAI’s service_tier and Vertex AI’s Gemini Flex/Priority offering:

standard / default (default) — baseline rates and latency.
flex — 0.5× cost (50% off), higher latency. Best for batch / background workloads.
priority — 1.8× cost, lowest latency. Best for latency-sensitive, user-facing workflows.

Omitting the field (or passing "auto" or null) resolves to standard. See the pricing guide for full details.

!pip install vlmrun

from pathlib import Path
from pydantic import BaseModel, Field
from vlmrun.client import VLMRun
from vlmrun.client.types import AgentExecutionResponse, AgentExecutionConfig
from vlmrun.types import MessageContent, FileUrl

# Define a Pydantic model for the execution inputs
class ExecutionInputs(BaseModel):
  file: MessageContent = Field(..., description="The file to extract data from")

# Define a Pydantic model for the response
class Invoice(BaseModel):
  invoice_id: str = Field(..., description="The ID of the invoice")
  total_amount: float = Field(..., description="The total amount of the invoice")

client = VLMRun(api_key="<VLMRUN_API_KEY>")

# Upload the file to the object store
file = client.files.upload(file=Path("test.pdf"))

# Execute the agent (by name and version)
response: AgentExecutionResponse = client.agent.execute(
  name="<agent-name>:<agent-version>",
  inputs=ExecutionInputs(
    file=MessageContent(type="file_url", file_url=FileUrl(url=file.public_url))
  ),
  batch=True,
)

# Execute the agent (by inline prompt)
response: AgentExecutionResponse = client.agent.execute(
  inputs=ExecutionInputs(
    file=MessageContent(type="file_url", file_url=FileUrl(url=file.public_url))
  ),
  config=AgentExecutionConfig(
    prompt="Extract the invoice_id and total amount from the invoice.",
    response_model=Invoice,
  ),
  batch=True,
)

Authorizations

Authorization

string

header

required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json

Request to execute an agent.

metadata

RequestMetadata · object

Optional metadata to pass to the model.

Show child attributes

config

AgentExecutionConfig · object

The configuration for the agent execution request.

Show child attributes

string

Unique identifier of the request.

created_at

string<date-time>

Date and time when the request was created (in UTC timezone)

callback_url

string<uri> | null

The URL to call when the request is completed.

Minimum string length: 1

model

enum<string> | null

VLM Run Agent model to use for execution. When omitted, the skill's vlmrun.yaml model is used; otherwise the agent default.

Available options:

vlmrun-orion-1,

vlmrun-orion-1:lite,

vlmrun-orion-1:auto,

vlmrun-orion-1:fast,

vlmrun-orion-1:pro,

vlmrun-orion-2.0,

vlmrun-orion-2.0:lite,

vlmrun-orion-2.0:auto,

vlmrun-orion-2.0:fast,

vlmrun-orion-2.0:pro

name

string | null

Name of the agent. If not provided, we use the prompt to identify the unique agent.

batch

boolean

default:true

Whether to process the document in batch mode (async).

inputs

AgentExecutionInputs · object

The inputs to the agent.

toolsets

enum<string>[] | null

List of tool categories to enable for this agent execution. Available categories: core, image, image-gen, world_gen, viz, document, video, web. When specified, only tools from these categories will be available. If None, defaults to 'core' tools only.

Available toolsets for agent tool selection.

Each toolset represents a category of related tools that can be enabled together for an agent execution.

Available options:

core,

code-execution,

document,

image,

image-gen,

video,

viz,

web,

world-gen

models

enum<string>[] | null

List of model-specific tool providers to enable for this execution. Available models: depth-anything-3, google-gemini-3-analysis, google-gemini-3-image, google-gemini-robotics-er, google-veo-3.1, meta-sam2, meta-sam3, meta-sam3d, microsoft-omniparser-v2, nvidia-cosmos-reason-2-8b, qwen-qwen3-vl-8b, vlm-dots-ocr. Multiple models can be selected — their tools are merged.

Available models for agent tool selection.

Each model represents a specialized capability backed by a specific model deployment. Multiple models can be selected simultaneously — pass a list and the tools are merged and deduplicated.

Usage in vlmrun.yaml::

model: vlmrun-orion-1:auto
toolsets:
  - core
  - image
models:
  - nvidia-cosmos-reason-2-8b
  - meta-sam3

Available options:

google-gemini-3-image,

google-gemini-3-analysis,

google-gemini-robotics-er,

google-veo-3.1,

microsoft-omniparser-v2,

qwen-qwen3-vl-8b,

meta-sam2,

meta-sam3,

meta-sam3d,

depth-anything-3,

vlm-dots-ocr,

nvidia-cosmos-reason-2-8b

Response

Successful Response

Response to the agent execution request.

name

string

required

Name of the agent

usage

CreditUsageResponse · object

The usage metrics for the request.

Show child attributes

string

Unique identifier of the agent execution response.

response

any | null

The response from the model.

status

enum<string>

default:pending

The status of the job.

Available options:

pending,

enqueued,

running,

completed,

failed,

paused

created_at

string<date-time>

Date and time when the execution was created (in UTC timezone)

completed_at

string<date-time> | null

Date and time when the execution was completed (in UTC timezone)

Create Agent Get Execution by ID

Documentation Index

​Request Inputs

​Generic payload — all input types

​Minimal payload shapes

Authorizations

Body

Response

Request Inputs

Generic payload — all input types

Minimal payload shapes