Caption & Tag

Generate comprehensive, contextual captions for videos using state-of-the-art vision-language models. Perfect for accessibility, content management, and automated video analysis workflows.

Example video to be captioned.

Example Response

This is an example of the response from the Chat Completions API example (using the video shown above):

Topic: The story of a multi-generational family bakery, its history, its destruction by fire, and the determination to rebuild and adapt the business for a new era.

Summary: The video chronicles the history of the Jenny Lee Bakery, a beloved institution in McKees Rocks, Pennsylvania, run by the Baker family for generations. It details the bakery's founding in 1941, its role in the community, and the passion for baking passed down through generations. The story takes a tragic turn with a devastating fire and a recession, leading to the closure and demolition of the bakery. However, the narrative concludes with the current generation, Scott Baker, deciding to rebuild the business with a modern, wholesale-focused approach.

Chapters (mm:ss format):

00 - 00:15: Scott Baker introduces himself and his family's deep-rooted connection to the McKees Rocks community through the Jenny Lee Bakery, which his grandfather opened in 1941.
15 - 00:31: A long-time employee and customer, Donna, shares fond memories of visiting the bakery for treats after church and later working there herself.
31 - 00:48: The video shows the transition to the next generation, with Scott's father, Bernie, taking over. Scott recalls his own childhood experiences working in the bakery and developing a love for the family business.
48 - 01:14: The narrative shifts to a tragic event, as Donna recounts learning that the bakery was on fire on Thanksgiving, a moment that cost her her job. Newspaper headlines confirm the devastating blaze.
14 - 01:42: Scott and his father, Bernie, recall the despair of seeing their life's work destroyed by the fire. The combination of the fire and the subsequent recession led to the difficult decision to close the bakery, which was later demolished.
42 - 02:08: Feeling burnt out, Scott was advised by his father to pursue a different career. However, Scott felt that baking was in his blood and was determined to revive the family business in McKees Rocks.
08 - 02:23: After researching the modern market and realizing the decline of traditional retail bakeries, Scott devises a new plan. He decides to adapt by creating a wholesale bakery to supply baked goods to other stores.

Usage Example

from vlmrun.client import VLMRun

# Initialize the VLM Run client
client = VLMRun(
  base_url="https://agent.vlm.run/v1", api_key="<VLMRUN_API_KEY>"
)

# Caption the video
response = client.agent.completions.create(
    model="vlmrun-orion-1:auto",
    messages=[
        {
          "role": "user",
          "content": [
            {"type": "text", "text": "Parse this video"},
            {"type": "video_url", "video_url": {"url": "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/video.transcription/bakery.mp4"}}
          ]
        }
    ],
)

# Print the response
print(response.choices[0].message.content)

FAQ

How do I ask the model for more detailed captions?

You can ask simply ask for a more detailed caption by providing a more detailed prompt. In most cases, you can provide the number of words you want the caption to be, and the model will generate a more detailed caption.

What tags are supported for videos?

Content Types: presentation, tutorial, interview, documentary, news
Scenes: office, outdoor, studio, classroom, conference room
People: presenter, audience, speaker, interviewer
Objects: whiteboard, charts, graphs, computer, microphone

What format do the video segments come in?

The video segments come in the format of a list of dictionaries with start time, end time, and description fields.

Can I get timestamps for different parts of the video?

Yes, the structured output includes segments with timestamps that break down the video into different parts with descriptions for each segment.

Get Started

Image Capabilities

Document Capabilities

Video Capabilities

Misc

Example Response

Usage Example

FAQ

Get Started

Image Capabilities

Document Capabilities

Video Capabilities

Misc

​Example Response

​Usage Example

​FAQ

Example Response

Usage Example

FAQ