Skip to main content

Overview of AI agent capabilities with VLM Run Agents.

Today’s frontier Vision-Language Models like GPT-5, Claude 4.5, and Gemini 2.5 Pro can describe images and answer questions, but they operate as monolithic inference engines. They generate descriptive outputs but cannot act on visual data with the precision, determinism, or compositional control required for production-grade workflows. vlm-agent-1 introduces a new paradigm for agentic visual reasoning and execution. Unlike monolithic VLMs, vlm-agent-1 orchestrates specialized computer vision tools – OCR, detection, segmentation, keypoint localization, diffusion, and geometric analysis – to execute complex multi-step visual workflows from natural language instructions. This marks the transition from passive visual understanding to autonomous, tool-augmented visual intelligence that bridges neural perception with symbolic execution.

Agents Supported

The latest generation of VLM Run agents are available in two flavors: vlm-agent-1:fast and vlm-agent-1:pro.

vlm-agent-1:fast

Our fast visual agent for simple multi-modal workflows. Optimized for speed and quick responses.

vlm-agent-1:pro

Our most capable visual agent for complex, multi-step workflows. Handles long tool-trajectories and advanced reasoning.
Looking to chat with vlm-agent-1? Visit chat.vlm.run.

What makes VLM Run Agents unique?

Here are some key features of VLM Run Agents that set it apart from other AI agent platforms:

Multi-Modal, Multi-Turn Reasoning

Execute complex multi-step visual workflows with adaptive context management across extended conversations.

First-class Visual AI Tools

Comprehensive suite of specialized tools across document, image, video, and multimodal processing—composable into multi-stage pipelines.

OpenAI-Compatible API

Use our OpenAI Chat Completions endpoint to interact with vlm-agent-1 with just 2 lines of code change.

Enterprise-Ready

Our agents are SOC2-Type 2 and HIPAA-compliant, production-ready with automatic validation, with support for full traceability and auditability.

How is vlm-agent-1 different from frontier models?

Unlike monolithic Vision-Language Models (VLMs like GPT-5, Claude 4.5, and Gemini 2.5), vlm-agent-1 delivers comprehensive capabilities across all modalities and tasks. The table below highlights key differences that matter for building production-grade visual workflows:
Taskvlm-agent-1GPT-5Gemini 2.5Claude Sonnet 4.5Qwen3-VL 235B-A22B
Image / VideoUnderstanding
Reasoning
Structured Outputs
Multi-modal Tool-Calling
Specialized Skills
DocumentUnderstanding
Reasoning
Structured Outputs
Multi-modal Tool-Calling
Specialized Skills
In the table above, we refer to Specialized Skills as tasks such as object localization, segmentation, image-generation / editing, or geometric tools typically found in specialized computer vision applications.
Key advantages for developers:
  • Mixed-modality Reasoning: Only vlm-agent-1 provides full reasoning across images, documents, and video - critical for building multi-step visual workflows.
  • Multi-modal Tool-Calling: With unique tool-calling support for images, videos and documents, vlm-agent-1 enables multi-modal reasoning and execution that other models cannot perform.
  • Production-Ready Structured Outputs: Consistent structured output support across all modalities with automatic validation and retry logic

Let’s get started!

Below you’ll find the API reference and code samples so you can start building intelligent agents for your use case. Sign up for an API key on our platform, then check out some of our cookbooks to learn how to use VLM Run Agents to build sophisticated visual AI workflows.