Ground extracted data with location (bounding box) coordinates and confidence scores.
Example of a driver's license with visual grounding enabled.
vlm-1
provides the location information for each of the Pydantic fields in the JSON response, under the metadata
key. The bounding box coordinates are represented in a normalized xywh
format (see Bounding Box below)
grounding
parameter to True
in your GenerationConfig
:
customer_name
and invoice_date
fields:
*_metadata.confidence
hi
: High confidence in the extraction accuracymed
: Medium confidence, suggesting some uncertaintylow
: Low confidence, indicating potential inaccuracy*_metadata.bboxes
xywh
format, where each value is between 0 and 1, representing:
x
: horizontal position of the top-left corner (0 = left edge, 1 = right edge)y
: vertical position of the top-left corner (0 = top edge, 1 = bottom edge)w
: width of the box (0 = no width, 1 = full image/document width)h
: height of the box (0 = no height, 1 = full image/document height)_metadata.bbox
may have multiple bounding boxes, along with the page number metadata under _metadata.bboxes[].page
.invoice_number
for back-office operations)