Skip to main content
vlmrun.yaml defines how a skill executes within VLM Run’s Orion agent. It specifies the model, available toolsets, and a state machine graph that orchestrates the agent’s workflow.

Format

apiVersion: vlm.run/v1alpha
metadata: {}
model: vlmrun-orion-1:auto
toolsets:
  - core
  - image
graph: |
  stateDiagram-v2
    AnalyzeImage: Analyze input image content
    Output: Produce structured output conforming to the target schema

    [*] --> AnalyzeImage
    AnalyzeImage --> Output
    Output --> [*]

Fields

FieldTypeRequiredDescription
apiVersionstringYesAPI version — currently vlm.run/v1alpha
metadataobjectNoReserved for future use (pass {})
modelstringYesOrion model variant to use
toolsetsstring[]YesTool categories available to the agent
graphstringYesState machine definition in Mermaid stateDiagram-v2 format
planstringNoHuman-readable explanation of the state machine

Model Variants

ModelDescription
vlmrun-orion-1:fastOptimized for speed
vlmrun-orion-1:autoBalanced speed and quality (recommended)
vlmrun-orion-1:proMaximum quality

Toolsets

ToolsetDescription
coreBasic operations (file I/O, text processing)
imageImage analysis and understanding
image-genImage generation and editing
videoVideo analysis and understanding
documentDocument extraction and layout understanding

State Machine Graph

The graph field uses Mermaid stateDiagram-v2 syntax to define the agent’s execution flow. Each state represents a step the agent performs, with transitions defining the order.

Syntax

stateDiagram-v2
  StateName: Description of what the agent does in this state

  [*] --> FirstState        # Entry point
  FirstState --> NextState   # Transition
  FinalState --> [*]         # Exit point

Simple Example (2 states)

A basic image analysis skill with analyze → output flow:
graph: |
  stateDiagram-v2
    AnalyzeImage: Analyze and process image content using multi-modal vision capabilities to identify objects, extract visual features, detect text, and understand spatial relationships within the scene
    Output: Aggregate and reason over all analysis results, validate extracted information for consistency, and produce a well-structured output conforming to the target schema

    [*] --> AnalyzeImage
    AnalyzeImage --> Output
    Output --> [*]

Multi-Step Example (3+ states)

A video analysis skill with multiple processing stages:
graph: |
  stateDiagram-v2
    AnalyzeVideo: Watch the full video to understand the assembly workflow and identify key segments
    ExtractInteractions: For each identified segment, extract detailed interaction data including timestamps and classifications
    Output: Aggregate all extracted interactions, validate for consistency, and produce structured output

    [*] --> AnalyzeVideo
    AnalyzeVideo --> ExtractInteractions
    ExtractInteractions --> Output
    Output --> [*]

Plan Field

The optional plan field provides a human-readable explanation of the state machine. It helps document the skill’s workflow for other developers:
plan: |
  ## Objective
  Analyze assembly videos to detect and label finger-kitting interactions.

  ## Nodes
  - `AnalyzeVideo`: Watch the full video to understand the overall workflow
  - `ExtractInteractions`: Identify each kitting interaction with timestamps
  - `Output`: Compile results into the target schema format
Write descriptive state names and descriptions. The agent uses them to understand what to do at each step — more detail leads to better execution.