
Overview of AI agent capabilities with VLM Run Agents.
vlm-agent-1 introduces a new paradigm for agentic visual reasoning and execution. Unlike monolithic VLMs, vlm-agent-1 orchestrates specialized computer vision tools – OCR, detection, segmentation, keypoint localization, diffusion, and geometric analysis – to execute complex multi-step visual workflows from natural language instructions. This marks the transition from passive visual understanding to autonomous, tool-augmented visual intelligence that bridges neural perception with symbolic execution.
Agents Supported
The latest generation of VLM Run agents are available in two flavors:vlm-agent-1:fast and vlm-agent-1:pro.
vlm-agent-1:fast
Our fast visual agent for simple multi-modal workflows. Optimized for speed and quick responses.
vlm-agent-1:pro
Our most capable visual agent for complex, multi-step workflows. Handles long tool-trajectories and advanced reasoning.
Looking to chat with
vlm-agent-1? Visit chat.vlm.run.What makes VLM Run Agents unique?
Here are some key features of VLM Run Agents that set it apart from other AI agent platforms:Multi-Modal, Multi-Turn Reasoning
Execute complex multi-step visual workflows with adaptive context management across extended conversations.
First-class Visual AI Tools
Comprehensive suite of specialized tools across document, image, video, and multimodal processing—composable into multi-stage pipelines.
OpenAI-Compatible API
Use our OpenAI Chat Completions endpoint to interact with
vlm-agent-1 with just 2 lines of code change.Enterprise-Ready
Our agents are SOC2-Type 2 and HIPAA-compliant, production-ready with automatic validation, with support for full traceability and auditability.
How is vlm-agent-1 different from frontier models?
Unlike monolithic Vision-Language Models (VLMs like GPT-5, Claude 4.5, and Gemini 2.5), vlm-agent-1 delivers comprehensive capabilities across all modalities and tasks. The table below highlights key differences that matter for building production-grade visual workflows:
| Task | vlm-agent-1 | GPT-5 | Gemini 2.5 | Claude Sonnet 4.5 | Qwen3-VL 235B-A22B | |
|---|---|---|---|---|---|---|
| Image / Video | Understanding | ✓ | ⚠ | ✓ | ⚠ | ✓ |
| Reasoning | ✓ | ✗ | ✗ | ✗ | ✓ | |
| Structured Outputs | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Multi-modal Tool-Calling | ✓ | ✗ | ✗ | ✗ | ⚠ | |
| Specialized Skills | ✓ | ✗ | ⚠ | ⚠ | ✗ | |
| Document | Understanding | ✓ | ✓ | ✓ | ✓ | ✓ |
| Reasoning | ✓ | ✓ | ✓ | ✓ | ✗ | |
| Structured Outputs | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Multi-modal Tool-Calling | ✓ | ⚠ | ⚠ | ⚠ | ✗ | |
| Specialized Skills | ✓ | ✓ | ⚠ | ✓ | ✗ | |
In the table above, we refer to Specialized Skills as tasks such as object localization, segmentation, image-generation / editing, or geometric tools typically found in specialized computer vision applications.
- Mixed-modality Reasoning: Only
vlm-agent-1provides full reasoning across images, documents, and video - critical for building multi-step visual workflows. - Multi-modal Tool-Calling: With unique tool-calling support for images, videos and documents,
vlm-agent-1enables multi-modal reasoning and execution that other models cannot perform. - Production-Ready Structured Outputs: Consistent structured output support across all modalities with automatic validation and retry logic