Skip to main content
Generate comprehensive, contextual captions for videos using state-of-the-art vision-language models. Perfect for accessibility, content management, and automated video analysis workflows.

Example video to be captioned.

Example Response

This is an example of the response from the Chat Completions API example (using the video shown above):
Topic: The story of a multi-generational family bakery, its history, its destruction by fire, and the determination to rebuild and adapt the business for a new era.

Summary: The video chronicles the history of the Jenny Lee Bakery, a beloved institution in McKees Rocks, Pennsylvania, run by the Baker family for generations. It details the bakery's founding in 1941, its role in the community, and the passion for baking passed down through generations. The story takes a tragic turn with a devastating fire and a recession, leading to the closure and demolition of the bakery. However, the narrative concludes with the current generation, Scott Baker, deciding to rebuild the business with a modern, wholesale-focused approach.

Chapters (mm:ss format):

00:00 - 00:15: Scott Baker introduces himself and his family's deep-rooted connection to the McKees Rocks community through the Jenny Lee Bakery, which his grandfather opened in 1941.
00:15 - 00:31: A long-time employee and customer, Donna, shares fond memories of visiting the bakery for treats after church and later working there herself.
00:31 - 00:48: The video shows the transition to the next generation, with Scott's father, Bernie, taking over. Scott recalls his own childhood experiences working in the bakery and developing a love for the family business.
00:48 - 01:14: The narrative shifts to a tragic event, as Donna recounts learning that the bakery was on fire on Thanksgiving, a moment that cost her her job. Newspaper headlines confirm the devastating blaze.
01:14 - 01:42: Scott and his father, Bernie, recall the despair of seeing their life's work destroyed by the fire. The combination of the fire and the subsequent recession led to the difficult decision to close the bakery, which was later demolished.
01:42 - 02:08: Feeling burnt out, Scott was advised by his father to pursue a different career. However, Scott felt that baking was in his blood and was determined to revive the family business in McKees Rocks.
02:08 - 02:23: After researching the modern market and realizing the decline of traditional retail bakeries, Scott devises a new plan. He decides to adapt by creating a wholesale bakery to supply baked goods to other stores.

Usage Example

import openai

# Initialize the OpenAI client
client = openai.OpenAI(
  base_url="https://agent.vlm.run/v1/openai",
  api_key="<VLMRUN_API_KEY>"
)

# Caption the video
response = client.chat.completions.create(
    model="vlm-agent-1",
    messages=[
        {
          "role": "user",
          "content": [
            {"type": "text", "text": "Parse this video"},
            {"type": "video_url", "video_url": {"url": "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/video.transcription/bakery.mp4"}}
          ]
        }
    ],
)

# Print the response
print(response.choices[0].message.content)

FAQ

You can ask simply ask for a more detailed caption by providing a more detailed prompt. In most cases, you can provide the number of words you want the caption to be, and the model will generate a more detailed caption.
  • Content Types: presentation, tutorial, interview, documentary, news
  • Scenes: office, outdoor, studio, classroom, conference room
  • People: presenter, audience, speaker, interviewer
  • Objects: whiteboard, charts, graphs, computer, microphone
The video segments come in the format of a list of dictionaries with start time, end time, and description fields.
Yes, the structured output includes segments with timestamps that break down the video into different parts with descriptions for each segment.