Multi-modal Artifacts

Artifacts are binary objects generated during agent interactions, such as images, videos, audio files, and documents. When agents perform operations like image generation, face blurring, video trimming, or document processing, the results are stored as artifacts that can be retrieved using object references.

Object References

Agent responses return object references (refs) instead of raw binary data. Each reference is a string identifier that follows a specific format: a 3-5 letter type prefix followed by an underscore and a 6-digit hexadecimal string (e.g., img_a1b2c3).

Artifact Type	Prefix	Reference Type	Python Return Type
Image	`img_`	`ImageRef`	`PIL.Image.Image`
Video	`vid_`	`VideoRef`	`Path` (mp4)
Audio	`aud_`	`AudioRef`	`Path` (mp3)
Document	`doc_`	`DocumentRef`	`Path` (pdf)
Reconstruction	`recon_`	`ReconRef`	`Path` (spz)
URL	`url_`	`UrlRef`	`Path` (any of the above)
Array	`arr_`	`ArrayRef`	`np.ndarray`

Import reference types from the SDK:

from vlmrun.types import ImageRef, VideoRef, AudioRef, DocumentRef, ReconRef, UrlRef

Retrieving an Artifact

In a Chat Completion

To retrieve a chat completion artifact, use the session_id from the chat response and the object_id (returned as a Ref type) from the JSON result.

from pathlib import Path
from pydantic import BaseModel, Field
from PIL import Image
from vlmrun.client import VLMRun
from vlmrun.types import ImageRef

client = VLMRun(base_url="https://agent.vlm.run/v1", api_key="<VLMRUN_API_KEY>")

# Define a response model with an ImageRef field
class BlurredImageResponse(BaseModel):
    image: ImageRef = Field(..., description="The blurred image")

# Make a chat completion request
response = client.agent.completions.create(
    model="vlmrun-orion-1:auto",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Blur all the faces in this image"},
                {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}}
            ]
        }
    ],
    response_format={
        "type": "json_schema",
        "schema": BlurredImageResponse.model_json_schema()
    }
)

# Parse the response
result = BlurredImageResponse.model_validate_json(response.choices[0].message.content)

# Retrieve the artifact using session_id and object_id
blurred_image: Image.Image = client.artifacts.get(
    session_id=response.session_id,
    object_id=result.image.id
)

# Display or save the image
blurred_image.save("blurred_output.jpg")

In an Agent Execution

To retrieve an artifact for a specific agent execution, use the execution_id from the agent execution response and the object_id (returned as a Ref type) from the JSON result.

Python

from pydantic import BaseModel, Field
from PIL import Image

from vlmrun.client import VLMRun
from vlmrun.client.types import AgentExecutionResponse, AgentExecutionConfig, ImageUrl
from vlmrun.types import ImageRef, MessageContent

client = VLMRun(base_url="https://agent.vlm.run/v1", api_key="<VLMRUN_API_KEY>")

# Define typed inputs using MessageContent
class ExecutionInputs(BaseModel):
    image: MessageContent = Field(..., description="The input image")

class ImageResponse(BaseModel):
    image: ImageRef = Field(..., description="The processed image")

# Execute an agent with typed inputs
execution: AgentExecutionResponse = client.agent.execute(
    name="image/blur-image",
    inputs=ExecutionInputs(
        image=MessageContent(type="image_url", image_url=ImageUrl(url="https://example.com/photo.jpg"))
    ),
    config=AgentExecutionConfig(
        prompt="Blur the entire image",
        response_model=ImageResponse
    )
)

# Wait for completion
execution = client.executions.wait(execution.id, timeout=180)

# Parse the response and retrieve the artifact
result = ImageResponse.model_validate(execution.response)
image: Image.Image = client.artifacts.get(
    execution_id=execution.id,
    object_id=result.image.id
)

Retrieving Multiple Artifacts

Agents can return multiple artifacts in a single response:

Python

from typing import List
from pydantic import BaseModel, Field
from PIL import Image
from vlmrun.client import VLMRun
from vlmrun.types import ImageRef

client = VLMRun(base_url="https://agent.vlm.run/v1", api_key="<VLMRUN_API_KEY>")

class VideoFramesResponse(BaseModel):
    class VideoFrame(BaseModel):
        image: ImageRef = Field(..., description="The video frame image")
        timestamp: str = Field(..., description="Timestamp in HH:MM:SS format")
        description: str = Field(..., description="Description of the scene")

    frames: List[VideoFrame] = Field(..., description="Extracted frames")

response = client.agent.completions.create(
    model="vlmrun-orion-1:auto",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Extract 5 key frames from this video"},
                {"type": "video_url", "video_url": {"url": "https://example.com/video.mp4"}}
            ]
        }
    ],
    response_format={
        "type": "json_schema",
        "schema": VideoFramesResponse.model_json_schema()
    }
)

result = VideoFramesResponse.model_validate_json(response.choices[0].message.content)

# Retrieve all frame artifacts
frames: List[Image.Image] = [
    client.artifacts.get(session_id=response.session_id, object_id=frame.image.id)
    for frame in result.frames
]

for i, (frame, frame_data) in enumerate(zip(frames, result.frames)):
    print(f"Frame {i+1}: [{frame_data.timestamp}] {frame_data.description}")
    frame.save(f"frame_{i+1}.jpg")

Artifact Caching

The Python and Node SDKs automatically caches downloaded artifacts to avoid re-downloading the same files. Artifacts are stored in ~/.vlmrun/artifacts/{session_id}/ with filenames based on the object ID and appropriate file extension.

Python

# First call downloads the artifact
video_path = client.artifacts.get(session_id=session_id, object_id="vid_a1b2c3")

# Subsequent calls return the cached file path
video_path = client.artifacts.get(session_id=session_id, object_id="vid_a1b2c3")  # No download

Common Use Cases