Skip to main content

Layout Detection Example on the Qwen-2.5 VL Tech Report.

Usage Example

For layout detection, we highly recommend using the Structured Outputs API to get the layout elements and bounding boxes in a structured and validated data format.
The following examples can detect headers, paragraphs, tables, lists, figures, and other document elements. The response schema includes bounding boxes, reading order and more.
import openai
from pydantic import BaseModel, Field

class LayoutElement(BaseModel):
  type: str = Field(..., description="Type of layout element (caption, footnote, formula, list-item, page-footer, page-header, picture, section-header, table, text, title)")
  xywh: tuple[float, float, float, float] = Field(..., description="Bounding box coordinates")

class LayoutResponse(BaseModel):
  elements: list[LayoutElement] = Field(..., description="List of detected layout elements")

# Initialize the client
client = openai.OpenAI(
    base_url="https://agent.vlm.run/v1/openai",
    api_key="<VLMRUN_API_KEY>"
)

# Analyze document layout
response = client.chat.completions.create(
    model="vlm-agent-1",
    messages=[
        {
          "role": "user",
          "content": [
            {"type": "text", "text": "Analyze the document layout and identify all elements with bounding boxes"},
            {"type": "image_url", "image_url": {"url": "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/document.layout/qwen-25-vl-tech-report.jpg", "detail": "auto"}}
          ]
        }
    ],
    response_format={"type": "json_schema", "schema": LayoutResponse.model_json_schema()},
)

# Print the response
print(response.choices[0].message.content)

# Validate the response
print(LayoutResponse.model_validate_json(response.choices[0].message.content))
>>> LayoutResponse(elements=[LayoutElement(type="caption", xywh=(0.1, 0.0, 0.8, 0.02)), LayoutElement(type="text", xywh=(0.1, 0.02, 0.8, 0.04)), LayoutElement(type="title", xywh=(0.1, 0.06, 0.8, 0.02)), LayoutElement(type="section-header", xywh=(0.1, 0.08, 0.8, 0.02)), LayoutElement(type="text", xywh=(0.1, 0.1, 0.8, 0.04)), LayoutElement(type="table", xywh=(0.1, 0.14, 0.8, 0.06)), LayoutElement(type="text", xywh=(0.1, 0.2, 0.8, 0.04)), LayoutElement(type="picture", xywh=(0.1, 0.24, 0.8, 0.06)), LayoutElement(type="text", xywh=(0.1, 0.3, 0.8, 0.04)), LayoutElement(type="formula", xywh=(0.1, 0.34, 0.8, 0.06)), LayoutElement(type="text", xywh=(0.1, 0.4, 0.8, 0.04)), LayoutElement(type="list-item", xywh=(0.1, 0.44, 0.8, 0.06)), LayoutElement(type="text", xywh=(0.1, 0.5, 0.8, 0.04)), LayoutElement(type="footnote", xywh=(0.1, 0.54, 0.8, 0.06)), LayoutElement(type="text", xywh=(0.1, 0.6, 0.8, 0.04)), LayoutElement(type="page-footer", xywh=(0.1, 0.64, 0.8, 0.06)), LayoutElement(type="text", xywh=(0.1, 0.7, 0.8, 0.04)), LayoutElement(type="page-header", xywh=(0.1, 0.74, 0.8, 0.06)), LayoutElement(type="text", xywh=(0.1, 0.8, 0.8, 0.04))])

FAQ

  • Headers: H1-H6 level headers with hierarchical structure
  • Paragraphs: Body text blocks with proper text flow
  • Titles: Main title of the document
  • Tables: Structured data with row/column detection
  • Figures: Images, charts, diagrams, and visual elements
  • Lists: Bulleted and numbered list structures
  • Captions: Figure and table captions with associations
  • Footnotes: Footnotes with references and content
  • Formulas: Mathematical formulas and equations
  • Pictures: Images and visual elements
  • Section Headers: Section headers and titles
The bounding boxes come in the format of xywh, where x and y are the top-left corner coordinates, and w and h are the width and height of the bounding box. All values are in pixels relative to the document image.
The reading order indicates the sequence in which elements should be read, following the natural document flow from top to bottom and left to right. This is useful for accessibility and content extraction.
Yes, the layout detection can process multi-page documents. Each page is analyzed separately, and the results include page-specific bounding boxes and reading orders.