Skip to main content
Generate comprehensive, contextual captions for videos using state-of-the-art vision-language models. Perfect for accessibility, content management, and automated video analysis workflows.

Example video to be captioned.

Video caption
"A presenter stands in front of a whiteboard, gesturing toward charts and graphs while explaining quarterly results to a seated audience."

Usage Example

import openai

# Initialize the OpenAI client
client = openai.OpenAI(
  base_url="https://agent.vlm.run/v1/openai",
  api_key="<VLMRUN_API_KEY>"
)

# Caption the video
response = client.chat.completions.create(
    model="vlm-agent-1",
    messages=[
        {"role": "user",
        "content": [
          {"type": "text", "text": "Generate a detailed caption for this video"},
          {"type": "image_url", "image_url": {"url": "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/video.caption/presentation.mp4", "detail": "auto"}}
        ]
      }
    ],
)

# Print the response
print(response.choices[0].message.content)
>> "A presenter stands in front of a whiteboard, gesturing toward charts and graphs while explaining quarterly results to a seated audience."

FAQ

You can ask simply ask for a more detailed caption by providing a more detailed prompt. In most cases, you can provide the number of words you want the caption to be, and the model will generate a more detailed caption.
  • Content Types: presentation, tutorial, interview, documentary, news
  • Scenes: office, outdoor, studio, classroom, conference room
  • People: presenter, audience, speaker, interviewer
  • Objects: whiteboard, charts, graphs, computer, microphone
The video segments come in the format of a list of dictionaries with start time, end time, and description fields.
Yes, the structured output includes segments with timestamps that break down the video into different parts with descriptions for each segment.