Document Form

Driver’s License

TV News Broadcast Text

Visual Grounding Example showing text-to-visual element mapping with highlighted connections.
Usage Example
For visual grounding, we highly recommend using the Structured Outputs API to get the text-visual mappings and spatial relationships in a structured and validated data format.
The following examples can map text elements to their visual locations, detect spatial relationships, and identify cross-references in documents. The response schema includes bounding boxes, confidence scores, and relationship types.
FAQ
What types of text-visual mappings are supported?
What types of text-visual mappings are supported?
- Form Fields: Connect labels with input fields, checkboxes, and buttons
- Data Fields: Map data labels with their corresponding values
- Interactive Elements: Link text instructions with clickable elements
- Validation Rules: Connect validation text with form fields
- Cross-References: Map text mentions with figures, tables, and sections
What format do the bounding boxes come in?
What format do the bounding boxes come in?
The bounding boxes come in the format of
xywh
, where x
and y
are the top-left corner coordinates, and w
and h
are the width and height of the bounding box. All values are in pixels relative to the document image.What spatial relationships can be detected?
What spatial relationships can be detected?
- Label-Field Pairs: Identify which labels belong to which fields
- Hierarchical Structure: Understand parent-child relationships
- Proximity Analysis: Determine related elements based on spatial proximity
- Alignment Patterns: Detect aligned elements and groups
What is the confidence score?
What is the confidence score?
The confidence score is a value between 0 and 1 that indicates the confidence of the text-visual mapping. Higher scores indicate more reliable connections.
Can it process multi-page documents?
Can it process multi-page documents?
Yes, visual grounding can process multi-page documents. Each page is analyzed separately, and the results include page-specific mappings and relationships.