Foundation vision models like OpenAI’s GPT4o and Anthropic’s Claude Vision support question answering over visual inputs, a.k.a. chat with images. In practice however, we believe chat is not the ideal interface for many software workflows, especially those that require automation. Instead, developers want strongly-typed and validated outputs that can be easily integrated into their existing software workflows.

VLM-1 is built on exactly this insight - instead of free-form text outputs, we define our API in terms of fixed types for specific domains (e.g. PDF presentations, TV news, audio / video podcasts etc). The schemas defined can be arbitrarily nested, and can include lists, dictionaries, and other complex types that can richly capture the information contained in the input.

In other words, VLM-1 is purpose-built for what is popularly known as JSON mode. This mode is particularly useful for developers who want to build automation workflows, data pipelines, or other software systems that require structured data as output.

Supported Domains

In the following sections, we provide examples of schemas for various domains including presentation slides, TV news broadcasts, web-automation and more.

Extract Structured Data

With our pre-defined domains, you can quickly extract structured data from images, videos, and other visual content in a single API call. The extracted data will be validated against the schema you defined, ensuring that it conforms to the expected structure and types.

We support querying the API via RESTful endpoints, or using the OpenAI Python SDK with our OpenAI-Compatible API.

Custom Schemas

In addition to the schemas provided earlier, VLM-1 also supports custom schemas (see next page). This feature allows you to define your own schema for a specific domain or use-case, and use VLM-1 to extract structured data that conforms to that schema.