Visual Grounding

Connect text elements with their visual locations in documents for precise content understanding. Perfect for interactive document analysis, content verification, automated form filling, and document comparison workflows.

Document Form

Driver’s License

TV News Broadcast Text

Visual Grounding Example showing text-to-visual element mapping with highlighted connections.

Usage Example

For visual grounding, we highly recommend using the Structured Outputs API to get the text-visual mappings and spatial relationships in a structured and validated data format.

The following examples can map text elements to their visual locations, detect spatial relationships, and identify cross-references in documents. The response schema includes bounding boxes, confidence scores, and relationship types.

from pydantic import BaseModel, Field
from vlmrun.client import VLMRun

class GroundingWithText(BaseModel):
  content: str = Field(..., description="The text content")
  xywh: tuple[float, float, float, float] = Field(..., description="Bounding box coordinates in (x, y, width, height)")

class GroundingResponse(BaseModel):
  elements: list[GroundingWithText] = Field(..., description="Text to visual mappings")

# Initialize the VLM Run client
client = VLMRun(
    base_url="https://agent.vlm.run/v1", api_key="<VLMRUN_API_KEY>"
)

# Perform visual grounding
response = client.agent.completions.create(
    model="vlmrun-orion-1:auto",
    messages=[
        {
          "role": "user",
          "content": [
            {"type": "text", "text": "Localize all the speaker names in the TV news broadcast text and visualize them on the image. Only provide one bounding box for each speaker name."},
            {"type": "image_url", "image_url": {"url": "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/media.tv-news/finance_bb_3_speakers.jpg", "detail": "auto"}}
          ]
        }
    ],
    response_format={"type": "json_schema", "schema": GroundingResponse.model_json_schema()},
)

# Print the response
print(response.choices[0].message.content)
>>> {"elements": [{"content": "HAIDI STROUD-WATTS", "xywh": [0.428, 0.217, 0.128, 0.286]}, ...]}

# Validate the response
print(GroundingResponse.model_validate_json(response.choices[0].message.content))
>>> GroundingResponse(elements=[GroundingWithText(content="HAIDI STROUD-WATTS", xywh=(0.428, 0.217, 0.128, 0.286)), ...])

FAQ

What types of text-visual mappings are supported?

Form Fields: Connect labels with input fields, checkboxes, and buttons
Data Fields: Map data labels with their corresponding values
Interactive Elements: Link text instructions with clickable elements
Validation Rules: Connect validation text with form fields
Cross-References: Map text mentions with figures, tables, and sections

What format do the bounding boxes come in?

The bounding boxes come in the format of xywh, where x and y are the top-left corner coordinates, and w and h are the width and height of the bounding box. All values are in pixels relative to the document image.

What spatial relationships can be detected?

Label-Field Pairs: Identify which labels belong to which fields
Hierarchical Structure: Understand parent-child relationships
Proximity Analysis: Determine related elements based on spatial proximity
Alignment Patterns: Detect aligned elements and groups

What is the confidence score?

The confidence score is a value between 0 and 1 that indicates the confidence of the text-visual mapping. Higher scores indicate more reliable connections.

Can it process multi-page documents?

Yes, visual grounding can process multi-page documents. Each page is analyzed separately, and the results include page-specific mappings and relationships.

Get Started

Image Capabilities

Document Capabilities

Video Capabilities

Misc

Visual Grounding

Usage Example

FAQ

Get Started

Image Capabilities

Document Capabilities

Video Capabilities

Misc

​Usage Example

​FAQ

Usage Example

FAQ