While most traditional computer-vision models are specialized for specific tasks like image classification, captioning or tagging, VLM-1 can be used to simultaneously generate a wide range of structured outputs from images. This includes generating captions, tags, descriptions, and other structured data that can be used for cataloging, search, retrieval, and other applications.

Cataloging Product Images

Let’s look at a product cataloging example to see how VLM-1 can be used to generate structured data from images. In this example, we’ll use VLM-1 to generate captions, tags, and descriptions for a set of images of different products. This structured data can then be used to create a product catalog that can be searched, filtered, and analyzed in various ways.

For this example, we’re going to use a small fashion dataset ashraq/fashion-product-images-small

Preview of the 'fashion-product-images-small' dataset from Huggingface.

Defining a custom schema for cataloging

In the sections below, we’ll showcase a few notable features of the API for image cataloging. VLM-1 can automatically generate descriptions for products based on the images provided. This can be useful for creating detailed product listings, search results, or other content that requires structured descriptions of products. First let’s create a custom schema that will be used to generate the descriptions.

from typing import Literal
from pydantic import BaseModel, Field

class ProductCatalog(BaseModel):
    description: str = Field(..., description="A 2-sentence general visual description of the product embedded as an image.")
    category: str = Field(..., description="One or two-word category of the product (i.e, Apparel, Accessories, Footwear etc).")
    season: Literal["Fall", "Spring", "Summer", "Winter"] = Field(..., description="The season the product is intended for.")
    gender: Literal["Men", "Women", "Kids"] = Field(..., description="Gender or audience the product is intended for.") 
    

Extracting cataloging information from images

Once you have defined your custom schema, you can use VLM-1 to extract product cataloging information directly from images that conform to this schema. The extracted data will be validated against the schema you defined, ensuring that it conforms to the expected structure and types.

We support querying the API via RESTful endpoints, or using the OpenAI Python SDK with our OpenAI-Compatible API.

Example Product Cataloging Prediction

Let’s take a look at the sample output from the API for the first image of a navy plaid shirt in the product catalog. The API is able to generate a detailed description of the product, including the category, season, and gender it is intended for. This structured data can be used to create a product listing or search results for the product.

{
  "description": "A casual, button-up plaid shirt with short sleeves in a light fabric. The shirt features a combination of blue and white colors in a checkered pattern.",
  "category": "Apparel",
  "season": "Summer",
  "gender": "Men"
}

Let’s breakdown the output into their respective tasks:

  • Description (Captioning or Description Generation): Here, the API has generated a detailed description of the product, including the type of shirt, its features, and the colors and patterns it has. This can be useful for creating detailed product listings or search results for the product. This is a typical use-case for the Captioning or Description Generation task.
  • Category (Classification or Tagging): The API has also identified the category of the product as “Apparel”. This can be useful for categorizing products in a catalog or search results. This is a typical use-case for the Classification or Tagging task.
  • Season (Classification or Tagging): The API has identified the season the product is intended for as “Summer”. This can be useful for filtering products by season or for creating seasonal collections. This is a typical use-case for the Classification or Tagging task, however, the one additional feature is that we have a Literal type that restricts the possible values to a predefined set.
  • Gender (Classification or Tagging): The API has identified the gender the product is intended for as “Men”. This can be useful for filtering products by gender or audience. This is similar to the Season task, but with a different set of possible values.

Cataloging larger image catalogs

Once you have validated the output for a single image, you can scale this process to catalog larger volumes of images. You can use the same API call to generate structured data for multiple images, and then use this structured data to create a product catalog that can be searched, filtered, and analyzed in various ways. Better yet, you can also ingest the JSON directly into JSON-compatible databases like MongoDB, Elasticsearch, or even traditional SQL databases for searching over these images unlocking a wide range of semantic image-search and querying possibilities for your cataloging needs.

VLM-1 predictions for the fashion dataset.

Fine-tuning for custom cataloging

For enterprise use-cases where you need to fine-tune the model for custom-tailored cataloging tasks and improved accuracy, you can use our fine-tuning guides to customize the model performance and scalability needs. This can include fine-tuning the model on your own data, customizing the model architecture, or adding new capabilities to the model. Fine-tuning can help you improve the accuracy and performance of the model for your specific cataloging tasks, and also help you scale the model to handle larger volumes of images with more efficient, lightweight fine-tuned models that are optimized for your specific use-case.

This feature is currently only available for our enterprise-tier customers. If you are interested in using this feature, please contact us.

Get Started with our Image -> JSON API

Head over to our Image -> JSON to start building your own image processing pipeline with VLM-1. Sign-up for access to our API here.