Layout Detection

Layout Detection Example on the Qwen-2.5 VL Tech Report.

Usage Example

For layout detection, we highly recommend using the Structured Outputs API to get the layout elements and bounding boxes in a structured and validated data format.

The following examples can detect headers, paragraphs, tables, lists, figures, and other document elements. The response schema includes bounding boxes, reading order and more.

from pydantic import BaseModel, Field
from vlmrun.client import VLMRun

class LayoutElement(BaseModel):
  type: str = Field(..., description="Type of layout element (caption, footnote, formula, list-item, page-footer, page-header, picture, section-header, table, text, title)")
  xywh: tuple[float, float, float, float] = Field(..., description="Bounding box coordinates")

class LayoutResponse(BaseModel):
  elements: list[LayoutElement] = Field(..., description="List of detected layout elements")

# Initialize the VLM Run client
client = VLMRun(
    base_url="https://agent.vlm.run/v1", api_key="<VLMRUN_API_KEY>"
)

# Analyze document layout
response = client.agent.completions.create(
    model="vlmrun-orion-1:auto",
    messages=[
        {
          "role": "user",
          "content": [
            {"type": "text", "text": "Analyze the document layout and identify all elements with bounding boxes"},
            {"type": "image_url", "image_url": {"url": "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/document.layout/qwen-25-vl-tech-report.jpg", "detail": "auto"}}
          ]
        }
    ],
    response_format={"type": "json_schema", "schema": LayoutResponse.model_json_schema()},
)

# Print the response
print(response.choices[0].message.content)

# Validate the response
print(LayoutResponse.model_validate_json(response.choices[0].message.content))
>>> LayoutResponse(elements=[LayoutElement(type="caption", xywh=(0.1, 0.0, 0.8, 0.02)), LayoutElement(type="text", xywh=(0.1, 0.02, 0.8, 0.04)), LayoutElement(type="title", xywh=(0.1, 0.06, 0.8, 0.02)), LayoutElement(type="section-header", xywh=(0.1, 0.08, 0.8, 0.02)), LayoutElement(type="text", xywh=(0.1, 0.1, 0.8, 0.04)), LayoutElement(type="table", xywh=(0.1, 0.14, 0.8, 0.06)), LayoutElement(type="text", xywh=(0.1, 0.2, 0.8, 0.04)), LayoutElement(type="picture", xywh=(0.1, 0.24, 0.8, 0.06)), LayoutElement(type="text", xywh=(0.1, 0.3, 0.8, 0.04)), LayoutElement(type="formula", xywh=(0.1, 0.34, 0.8, 0.06)), LayoutElement(type="text", xywh=(0.1, 0.4, 0.8, 0.04)), LayoutElement(type="list-item", xywh=(0.1, 0.44, 0.8, 0.06)), LayoutElement(type="text", xywh=(0.1, 0.5, 0.8, 0.04)), LayoutElement(type="footnote", xywh=(0.1, 0.54, 0.8, 0.06)), LayoutElement(type="text", xywh=(0.1, 0.6, 0.8, 0.04)), LayoutElement(type="page-footer", xywh=(0.1, 0.64, 0.8, 0.06)), LayoutElement(type="text", xywh=(0.1, 0.7, 0.8, 0.04)), LayoutElement(type="page-header", xywh=(0.1, 0.74, 0.8, 0.06)), LayoutElement(type="text", xywh=(0.1, 0.8, 0.8, 0.04))])

FAQ

What layout elements are supported?

Headers: H1-H6 level headers with hierarchical structure
Paragraphs: Body text blocks with proper text flow
Titles: Main title of the document
Tables: Structured data with row/column detection
Figures: Images, charts, diagrams, and visual elements
Lists: Bulleted and numbered list structures
Captions: Figure and table captions with associations
Footnotes: Footnotes with references and content
Formulas: Mathematical formulas and equations
Pictures: Images and visual elements
Section Headers: Section headers and titles

What format do the bounding boxes come in?

The bounding boxes come in the format of xywh, where x and y are the top-left corner coordinates, and w and h are the width and height of the bounding box. All values are in pixels relative to the document image.

What is the reading order?

The reading order indicates the sequence in which elements should be read, following the natural document flow from top to bottom and left to right. This is useful for accessibility and content extraction.

Can it process multi-page documents?

Yes, the layout detection can process multi-page documents. Each page is analyzed separately, and the results include page-specific bounding boxes and reading orders.

Get Started

Concepts

Image Capabilities

Document Capabilities

Video Capabilities

Misc

Layout Detection

Usage Example

FAQ

Get Started

Concepts

Image Capabilities

Document Capabilities

Video Capabilities

Misc

​Usage Example

​FAQ

Usage Example

FAQ