VLM Run’s visual grounding capability connects extracted data to precise locations in your visual content. This feature maps structured data to specific coordinates within documents, images, or videos, providing spatial context for extracted information. With visual grounding, you can pinpoint exactly where data originated - for example, identifying which table in a document corresponds to a particular JSON object in your results.

Example of a driver's license with visual grounding enabled.

How Visual Grounding Works

When enabled, vlm-1 provides the location information for each of the Pydantic fields in the JSON response, under the metadata key. The bounding box coordinates are represented in a normalized xywh format (see Bounding Box below)

Using Visual Grounding

You can enable visual grounding in by simply setting the grounding parameter to True in your GenerationConfig:

from pathlib import Path
from vlmrun.client import VLMRun
from vlmrun.client.types import GenerationConfig, PredictionResponse

client = VLMRun(api_key="...")
prediction: PredictionResponse = client.image.generate(
    file=Path("path/to/license.jpg"),
    domain="document.us-drivers-license",
    config=GenerationConfig(grounding=True),
)
print(prediction.response.model_dump_json(indent=2))

Understanding the Output

For the purpose of this example, we have simplified the JSON response and metadata to only include the customer_name and invoice_date fields:

{
  "license_number": "1234567",
  "license_number_metadata": {
    "confidence": "hi",
    "bboxes": [
      {
        "xywh": [0.2, 0.15, 0.15, 0.05],
        "page": 0
      }
    ]
  },
  "license_expiration_date": "2014-01-05",
  "license_expiration_date_metadata": {
    "confidence": "hi",
    "bboxes": [
      {
        "xywh": [0.8, 0.2, 0.15, 0.05],
        "page": 0
      }
    ]
  }
  ...
}

1. Confidence Levels: *_metadata.confidence

Each grounded field includes a confidence score, which can be one of:

  • hi: High confidence in the extraction accuracy
  • med: Medium confidence, suggesting some uncertainty
  • low: Low confidence, indicating potential inaccuracy

These confidence values help you assess the reliability of the extracted data and decide whether manual review might be needed. Only a single confidence value is returned for each field (unlike the bounding boxes below).

2. Bounding Box: *_metadata.bboxes

The bounding box coordinates are represented in a normalized xywh format, where each value is between 0 and 1, representing:

  • x: horizontal position of the top-left corner (0 = left edge, 1 = right edge)
  • y: vertical position of the top-left corner (0 = top edge, 1 = bottom edge)
  • w: width of the box (0 = no width, 1 = full image/document width)
  • h: height of the box (0 = no height, 1 = full image/document height)

Visual Grounding on a Document

In the earlier example, we showed you how to ground the data fields in a single image. However, when working with documents, you may want to ground the data fields that may appear on multiple locations, across multiple pages. VLM Run’s visual grounding capability extends to multi-page documents, allowing you to extract and localize data across entire document sets.

When processing multi-page documents:

  • Instead of a single bounding box, each field _metadata.bbox may have multiple bounding boxes, along with the page number metadata under _metadata.bboxes[].page.
  • Page numbers are included in the metadata for each grounded element. This allows you to navigate to the correct page when users interact with the data or are looking to cite the data.

Here’s an example of visual grounding in a multi-page document:

{
  "invoice_number": "INV-2023-0042",
  "invoice_number_metadata": {
    "confidence": "hi",
    "bboxes": [
      {
        "xywh": [0.7, 0.1, 0.2, 0.05],
        "page": 1
      },
      {
        "xywh": [0.7, 0.1, 0.2, 0.05],
        "page": 2 // invoice_number can typically be found on each page of the document
      },
      ...
    ]
  },
  "total_amount": "1000",
  "total_amount_metadata": {
    "confidence": "hi",
    "bboxes": [
      {
        "xywh": [0.2, 0.2, 0.3, 0.05],
        "page": 0 // `total_amount` can typically be found on the first and last page of the document
      },
      {
        "xywh": [0.6, 0.8, 0.3, 0.05],
        "page": 6 // `total_amount` can typically be found on the first and last page of the document
      }
    ]
  }
}

Use Cases

Visual grounding enables several powerful applications:

  1. Document verification: Validate the location of key fields in identity documents (verification checks for KYC, AML, etc.)
  2. Data extraction audit: Verify the source location of extracted information, typically for high-stakes or sensitive applications (finance, healthcare, etc.)
  3. Interactive annotation: Build interfaces that highlight document regions as users interact with extracted data (e.g. highlight the bounding box around the invoice_number for back-office operations)
  4. Error correction: Easily identify and fix extraction errors by referring to the original location of the data.
  5. Document comparison: Compare the location of similar elements across different document versions

By combining structured data extraction with spatial localization, visual grounding provides a comprehensive solution for document processing tasks that require both the “what” and the “where” of information.

For a hands-on tutorial, check out our Visual Grounding Notebook .

Try our Document -> JSON API today

Head over to our Document -> JSON to start building your own document processing pipeline with VLM Run. Sign-up for access on our platform.