Visual Grounding Demo
Navigate over to the driver’s license playground to see the visual grounding in action.

Example of a driver's license with visual grounding enabled.
How Visual Grounding Works
When enabled,vlm-1
provides the location information for each of the Pydantic fields in the JSON response, under the metadata
key. The bounding box coordinates are represented in a normalized xywh
format (see Bounding Box below)
Using Visual Grounding
You can enable visual grounding in by simply setting thegrounding
parameter to True
in your GenerationConfig
:
Understanding the Output
For the purpose of this example, we have simplified the JSON response and metadata to only include thecustomer_name
and invoice_date
fields:
1. Confidence Levels: *_metadata.confidence
Each grounded field includes a confidence score, which can be one of:
hi
: High confidence in the extraction accuracymed
: Medium confidence, suggesting some uncertaintylow
: Low confidence, indicating potential inaccuracy
2. Bounding Box: *_metadata.bboxes
The bounding box coordinates are represented in a normalized xywh
format, where each value is between 0 and 1, representing:
x
: horizontal position of the top-left corner (0 = left edge, 1 = right edge)y
: vertical position of the top-left corner (0 = top edge, 1 = bottom edge)w
: width of the box (0 = no width, 1 = full image/document width)h
: height of the box (0 = no height, 1 = full image/document height)
Visual Grounding on a Document
In the earlier example, we showed you how to ground the data fields in a single image. However, when working with documents, you may want to ground the data fields that may appear on multiple locations, across multiple pages. VLM Run’s visual grounding capability extends to multi-page documents, allowing you to extract and localize data across entire document sets. When processing multi-page documents:- Instead of a single bounding box, each field
_metadata.bbox
may have multiple bounding boxes, along with the page number metadata under_metadata.bboxes[].page
. - Page numbers are included in the metadata for each grounded element. This allows you to navigate to the correct page when users interact with the data or are looking to cite the data.
Use Cases
Visual grounding enables several powerful applications:- Document verification: Validate the location of key fields in identity documents (verification checks for KYC, AML, etc.)
- Data extraction audit: Verify the source location of extracted information, typically for high-stakes or sensitive applications (finance, healthcare, etc.)
- Interactive annotation: Build interfaces that highlight document regions as users interact with extracted data (e.g. highlight the bounding box around the
invoice_number
for back-office operations) - Error correction: Easily identify and fix extraction errors by referring to the original location of the data.
- Document comparison: Compare the location of similar elements across different document versions
For a hands-on tutorial, check out our Visual Grounding Notebook
.