Temporal Grounding

Temporal Grounding Demo

Navigate over to the video-transcription playground in our playground to see the temporal grounding in action.

Temporal grounding is a powerful capability of VLM Run that links extracted data to precise time segments within audio and video content. This feature is especially valuable for applications that need to process and analyze time-based media—such as podcasts, interviews, lectures, and meetings—by providing structured, timestamped insights. Temporal grounding can be broadly categorized into two key functions:

Time Segmentation: Dividing content into meaningful segments, each with precise start and end timestamps.
Content Localization: Pinpointing exactly when and where specific information appears within the timeline.

Using Temporal Grounding

Temporal grounding when processing audio/video content is enabled by default for all audio/video domains.

from pathlib import Path
from vlmrun.client import VLMRun
from vlmrun.client.types import GenerationConfig

client = VLMRun(api_key="...")
response = client.video.generate(
    file=Path("path/to/episode.mp4"),
    domain="video.transcription",
)

Understanding the Output

The response includes temporal information for each extracted segment, including start and end times, speaker identification, and confidence scores:

{
  "metadata": {
    "duration": 248.67   // total duration of the video in seconds
  },
  "segments": [
    {
      "start_time": 0,   // start time of the first segment in seconds
      "end_time": 21.33, // end time of the first segment in seconds
      "audio": {
        "content": " The Keys Rocks is a rough and tumble area just outside Pittsburgh. My name is Scott Baker. My family has been connected with this community for generations. In 1941, my grandfather opened his bakery here, and he called it Jenny Lee. We used to always go to Jenny Lee's after church. If you were good in church,"
      },
      "video": {
        "content": "In this image we can see many buildings and trees. There is a bridge in the image. There are many towers in the background of the image and the sky is in white color."
      }
    },
    {
      "start_time": 21.33, // start time of the second segment in seconds
      "end_time": 42.5,    // end time of the second segment in seconds
      "audio": {
        "content": " oh, egg custard pot. Gooden Church. Oh, I ate custard pies, homemade bread that was still warm. Years later, I worked in the store. I did wedding cakes. My father, Bernie, took over after my grandfather retired. I am a baker. took over after my grandfather retired. I am a baker by name, baker by trade. I started coming in to the bakery with my dad when I was seven or eight years old."
      },
      "video": {
        "content": "In this image we can see a man standing on the floor. We can also see a group of people standing beside a table containing some food items in a cupboard. On the backside we can a wall with some photo frames and a roof with some ceiling lights."
      }
    },
    ...
    {
      "start_time": 230.67, // start time of the last segment in seconds
      "end_time": 248.67,   // end time of the last segment in seconds
      "audio": {
        "content": " I made that. I'm proud to say that. You know, You know, You know, Thank you."
      },
      "video": {
        "content": "In this image we can see a person holding a food item in his hand. In front of him there is a table. In the background of the image there are some objects."
      }
    }
  ]
}

If you want to test the grounding precision of our models, you can go to the VLM Run Platform and click on the start_time and end_time of any of the segments to skip to the corresponding audio/video segments.

Use Cases

Temporal grounding enables numerous applications:

Searchable Media Archives: Create searchable indexes of audio and video content
Meeting Summaries: Generate timestamped summaries of meetings with speaker attribution
Content Navigation: Build interfaces that allow users to jump to specific topics or speakers
Podcast Production: Automatically generate show notes with timestamps and speaker labels
Video Chapters: Create chapter markers for long-form video content
Interview Analysis: Extract insights from interviews with accurate speaker attribution
Compliance Monitoring: Track who said what and when in regulated communications

By leveraging VLM Run’s temporal grounding capabilities, you can extract rich, time-based structured data from audio and video content, enabling powerful applications that understand not just what was said, but who said it and when.

Try our Video / Audio -> JSON API today

Head over to our Video -> JSON or Audio -> JSON to start building your own video/audio processing pipelines with VLM Run. Sign-up for access on our platform.