Skip to main content
Connect text elements with their visual locations in documents for precise content understanding. Perfect for interactive document analysis, content verification, automated form filling, and document comparison workflows.
Document Form
Document form grounding example
Driver’s License
Example of a driver's license document with visual grounding
TV News Broadcast Text
TV news broadcast grounding example

Visual Grounding Example showing text-to-visual element mapping with highlighted connections.

Usage Example

For visual grounding, we highly recommend using the Structured Outputs API to get the text-visual mappings and spatial relationships in a structured and validated data format.
The following examples can map text elements to their visual locations, detect spatial relationships, and identify cross-references in documents. The response schema includes bounding boxes, confidence scores, and relationship types.
import openai
from pydantic import BaseModel, Field

class GroundingWithText(BaseModel):
  content: str = Field(..., description="The text content")
  xywh: tuple[float, float, float, float] = Field(..., description="Bounding box coordinates in (x, y, width, height)")

class GroundingResponse(BaseModel):
  elements: list[GroundingWithText] = Field(..., description="Text to visual mappings")

# Initialize the client
client = openai.OpenAI(
    base_url="https://agent.vlm.run/v1/openai",
    api_key="<VLMRUN_API_KEY>"
)

# Perform visual grounding
response = client.chat.completions.create(
    model="vlm-agent-1",
    messages=[
        {
          "role": "user",
          "content": [
            {"type": "text", "text": "Localize all the speaker names in the TV news broadcast text and visualize them on the image. Only provide one bounding box for each speaker name."},
            {"type": "image_url", "image_url": {"url": "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/media.tv-news/finance_bb_3_speakers.jpg", "detail": "auto"}}
          ]
        }
    ],
    response_format={"type": "json_schema", "schema": GroundingResponse.model_json_schema()},
)

# Print the response
print(response.choices[0].message.content)
>>> {"elements": [{"content": "HAIDI STROUD-WATTS", "xywh": [0.428, 0.217, 0.128, 0.286]}, ...]}

# Validate the response
print(GroundingResponse.model_validate_json(response.choices[0].message.content))
>>> GroundingResponse(elements=[GroundingWithText(content="HAIDI STROUD-WATTS", xywh=(0.428, 0.217, 0.128, 0.286)), ...])

FAQ

  • Form Fields: Connect labels with input fields, checkboxes, and buttons
  • Data Fields: Map data labels with their corresponding values
  • Interactive Elements: Link text instructions with clickable elements
  • Validation Rules: Connect validation text with form fields
  • Cross-References: Map text mentions with figures, tables, and sections
The bounding boxes come in the format of xywh, where x and y are the top-left corner coordinates, and w and h are the width and height of the bounding box. All values are in pixels relative to the document image.
  • Label-Field Pairs: Identify which labels belong to which fields
  • Hierarchical Structure: Understand parent-child relationships
  • Proximity Analysis: Determine related elements based on spatial proximity
  • Alignment Patterns: Detect aligned elements and groups
The confidence score is a value between 0 and 1 that indicates the confidence of the text-visual mapping. Higher scores indicate more reliable connections.
Yes, visual grounding can process multi-page documents. Each page is analyzed separately, and the results include page-specific mappings and relationships.