Visual Grounding
Ground extracted data with location (bounding box) coordinates and confidence scores.
Visual Grounding Demo
Navigate over to the driver’s license playground to see the visual grounding in action.
VLM Run’s visual grounding capability connects extracted data to precise locations in your visual content. This feature maps structured data to specific coordinates within documents, images, or videos, providing spatial context for extracted information. With visual grounding, you can pinpoint exactly where data originated - for example, identifying which table in a document corresponds to a particular JSON object in your results.
Example of a driver's license with visual grounding enabled.
How Visual Grounding Works
When enabled, vlm-1
provides the location information for each of the Pydantic fields in the JSON response, under the metadata
key. The bounding box coordinates are represented in a normalized xywh
format (see Bounding Box below)
Using Visual Grounding
You can enable visual grounding in by simply setting the grounding
parameter to True
in your GenerationConfig
:
Understanding the Output
For the purpose of this example, we have simplified the JSON response and metadata to only include the customer_name
and invoice_date
fields:
1. Confidence Levels: *_metadata.confidence
Each grounded field includes a confidence score, which can be one of:
hi
: High confidence in the extraction accuracymed
: Medium confidence, suggesting some uncertaintylow
: Low confidence, indicating potential inaccuracy
These confidence values help you assess the reliability of the extracted data and decide whether manual review might be needed. Only a single confidence value is returned for each field (unlike the bounding boxes below).
2. Bounding Box: *_metadata.bboxes
The bounding box coordinates are represented in a normalized xywh
format, where each value is between 0 and 1, representing:
x
: horizontal position of the top-left corner (0 = left edge, 1 = right edge)y
: vertical position of the top-left corner (0 = top edge, 1 = bottom edge)w
: width of the box (0 = no width, 1 = full image/document width)h
: height of the box (0 = no height, 1 = full image/document height)
Visual Grounding on a Document
In the earlier example, we showed you how to ground the data fields in a single image. However, when working with documents, you may want to ground the data fields that may appear on multiple locations, across multiple pages. VLM Run’s visual grounding capability extends to multi-page documents, allowing you to extract and localize data across entire document sets.
When processing multi-page documents:
- Instead of a single bounding box, each field
_metadata.bbox
may have multiple bounding boxes, along with the page number metadata under_metadata.bboxes[].page
. - Page numbers are included in the metadata for each grounded element. This allows you to navigate to the correct page when users interact with the data or are looking to cite the data.
Here’s an example of visual grounding in a multi-page document:
Use Cases
Visual grounding enables several powerful applications:
- Document verification: Validate the location of key fields in identity documents (verification checks for KYC, AML, etc.)
- Data extraction audit: Verify the source location of extracted information, typically for high-stakes or sensitive applications (finance, healthcare, etc.)
- Interactive annotation: Build interfaces that highlight document regions as users interact with extracted data (e.g. highlight the bounding box around the
invoice_number
for back-office operations) - Error correction: Easily identify and fix extraction errors by referring to the original location of the data.
- Document comparison: Compare the location of similar elements across different document versions
By combining structured data extraction with spatial localization, visual grounding provides a comprehensive solution for document processing tasks that require both the “what” and the “where” of information.
Try our Document -> JSON API today
Head over to our Document -> JSON to start building your own document processing pipeline with VLM Run. Sign-up for access on our platform.