Ground extracted data with location (bounding box) coordinates and confidence scores.
Navigate over to the driver’s license playground to see the visual grounding in action.
VLM Run’s visual grounding capability connects extracted data to precise locations in your visual content. This feature maps structured data to specific coordinates within documents, images, or videos, providing spatial context for extracted information. With visual grounding, you can pinpoint exactly where data originated - for example, identifying which table in a document corresponds to a particular JSON object in your results.
Example of a driver's license with visual grounding enabled.
When enabled, vlm-1
provides the location information for each of the Pydantic fields in the JSON response, under the metadata
key. The bounding box coordinates are represented in a normalized xywh
format (see Bounding Box below)
You can enable visual grounding in by simply setting the grounding
parameter to True
in your GenerationConfig
:
For the purpose of this example, we have simplified the JSON response and metadata to only include the customer_name
and invoice_date
fields:
*_metadata.confidence
Each grounded field includes a confidence score, which can be one of:
hi
: High confidence in the extraction accuracymed
: Medium confidence, suggesting some uncertaintylow
: Low confidence, indicating potential inaccuracyThese confidence values help you assess the reliability of the extracted data and decide whether manual review might be needed. Only a single confidence value is returned for each field (unlike the bounding boxes below).
*_metadata.bboxes
The bounding box coordinates are represented in a normalized xywh
format, where each value is between 0 and 1, representing:
x
: horizontal position of the top-left corner (0 = left edge, 1 = right edge)y
: vertical position of the top-left corner (0 = top edge, 1 = bottom edge)w
: width of the box (0 = no width, 1 = full image/document width)h
: height of the box (0 = no height, 1 = full image/document height)In the earlier example, we showed you how to ground the data fields in a single image. However, when working with documents, you may want to ground the data fields that may appear on multiple locations, across multiple pages. VLM Run’s visual grounding capability extends to multi-page documents, allowing you to extract and localize data across entire document sets.
When processing multi-page documents:
_metadata.bbox
may have multiple bounding boxes, along with the page number metadata under _metadata.bboxes[].page
.Here’s an example of visual grounding in a multi-page document:
Visual grounding enables several powerful applications:
invoice_number
for back-office operations)By combining structured data extraction with spatial localization, visual grounding provides a comprehensive solution for document processing tasks that require both the “what” and the “where” of information.
Head over to our Document -> JSON to start building your own document processing pipeline with VLM Run. Sign-up for access on our platform.
Ground extracted data with location (bounding box) coordinates and confidence scores.
Navigate over to the driver’s license playground to see the visual grounding in action.
VLM Run’s visual grounding capability connects extracted data to precise locations in your visual content. This feature maps structured data to specific coordinates within documents, images, or videos, providing spatial context for extracted information. With visual grounding, you can pinpoint exactly where data originated - for example, identifying which table in a document corresponds to a particular JSON object in your results.
Example of a driver's license with visual grounding enabled.
When enabled, vlm-1
provides the location information for each of the Pydantic fields in the JSON response, under the metadata
key. The bounding box coordinates are represented in a normalized xywh
format (see Bounding Box below)
You can enable visual grounding in by simply setting the grounding
parameter to True
in your GenerationConfig
:
For the purpose of this example, we have simplified the JSON response and metadata to only include the customer_name
and invoice_date
fields:
*_metadata.confidence
Each grounded field includes a confidence score, which can be one of:
hi
: High confidence in the extraction accuracymed
: Medium confidence, suggesting some uncertaintylow
: Low confidence, indicating potential inaccuracyThese confidence values help you assess the reliability of the extracted data and decide whether manual review might be needed. Only a single confidence value is returned for each field (unlike the bounding boxes below).
*_metadata.bboxes
The bounding box coordinates are represented in a normalized xywh
format, where each value is between 0 and 1, representing:
x
: horizontal position of the top-left corner (0 = left edge, 1 = right edge)y
: vertical position of the top-left corner (0 = top edge, 1 = bottom edge)w
: width of the box (0 = no width, 1 = full image/document width)h
: height of the box (0 = no height, 1 = full image/document height)In the earlier example, we showed you how to ground the data fields in a single image. However, when working with documents, you may want to ground the data fields that may appear on multiple locations, across multiple pages. VLM Run’s visual grounding capability extends to multi-page documents, allowing you to extract and localize data across entire document sets.
When processing multi-page documents:
_metadata.bbox
may have multiple bounding boxes, along with the page number metadata under _metadata.bboxes[].page
.Here’s an example of visual grounding in a multi-page document:
Visual grounding enables several powerful applications:
invoice_number
for back-office operations)By combining structured data extraction with spatial localization, visual grounding provides a comprehensive solution for document processing tasks that require both the “what” and the “where” of information.
Head over to our Document -> JSON to start building your own document processing pipeline with VLM Run. Sign-up for access on our platform.