While traditional speech-to-text models focus solely on transcription, vlm-1 can simultaneously generate rich, structured insights from audio content. This includes transcription, chapter segmentation, topic extraction, entity recognition, and sentiment analysis - all in a single API call. These capabilities are particularly valuable for podcast analysis, interview processing, and content management systems. In this guide, we’ll walk you through how to use the audio.transcription domain to transcribe and analyze long-form audio content.

In subsequent guides, we’ll cover more advanced capabilities like topic extraction, entity recognition, and sentiment analysis.


Analyzing Podcast Episodes

Let’s look at a podcast analysis example to see how vlm-1 can be used to extract structured insights from audio content. In this example, we’ll use vlm-1 to transcribe and analyze a podcast episode, generating segmented chapters with start and end timestamps, and corresponding full transcript that can be used for content organization and discovery.

from vlmrun.client import VLMRun
from vlmrun.client.types import PredictionResponse, GenerationConfig

# Initialize the client
client = VLMRun(api_key="<VLMRUN_API_KEY>")

# Submit the audio file for transcription
prediction: PredictionResponse = client.audio.generate(
    file=Path("path/to/podcast_episode.mp3"),
    domain="audio.transcription",
    batch=True,
)
print(prediction.response.model_dump())

# Wait for the prediction to complete (with a timeout of 600 seconds)
prediction: PredictionResponse = client.predictions.wait(id=prediction.id, timeout=600)
print(prediction.response.model_dump())

Understanding the Output

Here’s an example of the output from the audio.transcription domain:

Example Audio Transcription
{
  "metadata": {
    "duration": 146.94
  },
  "segments": [
    {
      "start_time": 0,
      "end_time": 24.88,
      "content": " After reading tons of productivity books, I came across so many rules, like the two-year rule, the five-minute rule, the five-second rule. No, not that five second rule. The problem is that these rules were meant for companies or entrepreneurs, but I was able to adapt them to my studies during med school and drastically cut down to my procrastination. So I'm going to share with you two different two minute rules for the next two minutes. The first two minute rule comes from"
    },
    {
      "start_time": 24.88,
      "end_time": 45.5,
      "content": " getting things done by David Allen. He says if it takes two minutes to do, get it done right now. For example, if I need to take out the trash today, it takes two minutes to do. So if I'm thinking about it now, might as well just do it now. Instead of writing it down on a to-do list or probably forgetting about it or having to come back to it later, which takes more than two minutes. That's how I see it."
    },
    {
      "start_time": 45.5,
      "end_time": 67.86,
      "content": " So here's a list of things that might take two minutes throughout the day, like organizing your desk or watering your plants or clipping those nasty nails. I just do it when I notice it, but these little things start to add up, so this rule biases my brain towards taking action and away from procrastination. The second two-minute rule comes from atomic habits by James Clear. He says, when you're trying to do something you don't really want to do, simplify the"
    },
    {
      "start_time": 67.86,
      "end_time": 91.27,
      "content": " task down to two minutes or less. So doing your entire reading assignment becomes just reading one paragraph or memorizing the entire periodic table becomes memorizing just 10 flashcards. Now, some of you might think, yeah, this is just a Jedi mind trick. Like, why would I fall for it? How is this at all sustainable? And to that, he says, when you're starting out, limit yourself to only two minutes."
    },
    {
      "start_time": 91.27,
      "end_time": 117.33,
      "content": " So back in med school, I wanted to build a habit of studying for one hour every day before dinner. So I tried this trick, but I limited myself to just two minutes. I'd sit down, open my laptop, study for two minutes, and then close my laptop and went to do something else. It seems unproductive at first, right? It seems stupid. But staying consistent with this two-minute routine day after day meant that I was becoming the type of person who studies daily."
    },
    {
      "start_time": 117.33,
      "end_time": 137.99,
      "content": " I was mastering the habit of just showing up because a habit needs to be established before it can be expanded upon. If I can't become a person who studies for just two minutes a day, I'd never be able to become the person that studies for an hour a day. You've got to start somewhere, but starting small is easier. There's a lot of other useful tips from books."
    },
    {
      "start_time": 138.15,
      "end_time": 146.94,
      "content": " I cover more here in this video on three books and three minutes. Check it out. And if you guys like these types of videos, let me know in the comments below. I'll see you there. Bye."
    }
  ]
}

Let’s break down the output into its key components:

  • segments : Segmented audio content with start and end timestamps.
  • segment.start_time : The start time of the segment in seconds (relative to the start of the audio).
  • segment.end_time : The end time of the segment in seconds (relative to the start of the audio).
  • segment.content : The raw transcription of the spoken content.
  • metadata.duration : The total duration of the audio in seconds.

Key Features

  • Temporal Grounding: Precise time segmentation and content localization
  • Long-form Support: Process audio up to 12+ hours with automatic segmentation
  • Batch Processing: Efficient handling of large audio collections

Get Started with our Audio -> JSON API

Head over to our Audio -> JSON to start building your own audio processing pipeline with VLM Run. Sign-up for access on our platform.