VLM-1 can be used to extract rich insights from video podcasts and interviews. The API can be used to transcribe

Notebook Example

Colab

If you want to simply look at the code, skip to the colab notebook link directly here.

Feel free to look through the colab link above to get a sense of how to use the VLM-1 API to extract structured data from video podcasts and interviews.

Key Features

In the sections below, we’ll showcase a few notable features of the API for analyzing podcasts or video interviews. You can also refer to the features extracted in the Analyzing Audio Podcasts guide for a more detailed overview of the audio transcription capabilities of the API.

1. Automatic Chaptering and Summarization

VLM-1 can automatically generate chapter summaries for video podcasts or interviews. This can be useful for creating a table of contents for the video, or for generating a summary of the key points discussed in the video. As you can see in the sample output below, the API is able to extract a general visual description of the segment with timestamps, the highlighted chapter text (“AI Will Create More Successful Founders”), and different persons/objects in the scene that may be relevant for analysis.

2. Extracting Texts from Slides or Visual aids

VLM-1 can also extract text from slides or visual aids that are shown during the video. This can be useful for extracting key points, quotes, or other information that is presented visually in the video. As you can see in the sample output below, the API is able to extract the highlighted text from right portion of the screen (“Get In Early”) alongside all the other chapter texts displayed.

3. Extracting Scene Coordinates for Objects / Entities

This feature is currently in development and will be available soon.

VLM-1 can also extract the coordinates of objects or entities in the video scene. This can be useful for tracking the movement of objects or persons in the video, or for analyzing the spatial relationships between different entities. As you can see in the sample output below, the API is able to extract the coordinates of the highlighted text (“AI Will Create More Successful Founders”) in the video scene.