In certain documents, users may want to visually ground the described visual content by localizing the visual elements in the document. For example, in technical documents, we may want to know exactly which table a specific JSON table object corresponds to visually. This process is known as visual grounding.

VLM-1 provides the ability to extract visual groundings from documents with a simple interface. The extracted visual groundings can be used to understand the context of the visual elements in the document and to link them to the corresponding textual content. Let’s take a look at an example showcasing visually grounding tables in a hardware spec-sheet.

Example showcasing visually grounding tables.

The corresponding JSON output shows the bounding box coordinates of the visual grounding for each table (T1 and T2) in the document. This information can be used to link the visual elements to the corresponding textual content in the document.

{
  "id": "...",
  "created_at": "...",
  "completed_at": "...",
  {
  "charts": [],
  "tables": [
    {
      "description": "This table details the specifications for the current input loop powered, including parameters like input resolution, input range, programmable current limit, HART mode current limit, accuracy in terms of TUE, INL, offset error, gain error, and other input specifications like DC PSRR, input impedance, and headroom. Each parameter is accompanied by its minimum, typical, and maximum values, their units, and specific test conditions or comments.",
      "title": null,
      "caption": "Table 5",
      "markdown": "..."
      "annotation": "T0",
      "bbox": {
        "xywh": [
          0.07529411764705882,
          0.18545454545454546,
          0.8735294117647059,
          0.4136363636363637
        ]
      }
    },
    {
      "description": "This table provides the specifications for resistance measurement, including input range, bias voltage, pull-up resistor, and accuracy for different measurement ranges. Each parameter is described with its minimum, typical, and maximum values, and its units, along with test conditions or comments.",
      "title": null,
      "caption": "Table 6",
      "markdown": "...",
      "annotation": "T0",
      "bbox": {
        "xywh": [
          0.07470588235294118,
          0.7281818181818182,
          0.8688235294117648,
          0.21272727272727276
        ]
      }
    }
  ]
}
}

Get Started with our Document -> JSON API

Head over to our Document -> JSON to start building your own document processing pipeline with VLM-1. Sign-up for access to our API here.