Extracting data from tables within documents presents a significant challenge, especially with dense, complex layouts involving nested headers, merged cells, and implicit structural hierarchies. Simple text extraction or basic table parsing often fails to capture the full semantic and structural integrity required for reliable downstream analysis and processing. vlm-1, coupled with the specialized TableWithLayout schema, provides a robust solution designed for experts who demand high fidelity and structural preservation.

The Challenge: Beyond Simple Extraction

Traditional methods often flatten table structures, losing critical layout information:

  • Nested Headers: Hierarchical column relationships are lost.
  • Merged Cells: Spanning information is discarded, breaking row/column alignment.
  • Semantic Grouping: Visual cues indicating related data are ignored.

This loss of information hinders accurate data analysis, comparison across documents, and seamless integration into data pipelines (e.g., Pandas DataFrames).

MarkdownTable: A Schema for Structure and Semantics

To address these challenges, we introduce a new schema called MarkdownTable, meticulously designed to capture not just the cell content but also the table’s intrinsic structure and metadata.

from typing import List, Optional, Literal
from pydantic import BaseModel, Field

# Simplified representation for documentation
# Actual implementation includes more detailed fields
class MarkdownTable(BaseModel):
    """A table with layout information."""

    class MarkdownTableMetadata(BaseModel):
        title: str | None = Field(None, description="Title of the table.")
        caption: str | None = Field(None, description="Caption of the table.")

    class MarkdownTableHeader(BaseModel):
        id: str = Field(..., description="Unique, sanitized identifier (e.g., 'col_subcol').")
        column: int = Field(..., description="0-indexed column index.")
        name: str = Field(..., description="Hierarchical name (e.g., 'Header > Subheader').")
        dtype: str | None = Field(None, description="Inferred data type.")

    metadata: MarkdownTableMetadata = Field(default_factory=MarkdownTableMetadata)
    content: str | None = Field(
        None,
        description="Table data in Markdown format, optimized for structure preservation."
    )
    headers: List[MarkdownTableHeader] = Field(..., description="List of structured header objects.")

Key Design Benefits:

  1. Hierarchical Headers (headers):
    • The name field uses > to represent nesting (e.g., Performance > Max Value).
    • The unique id provides a stable reference for each column, crucial for programmatic access and comparison.
    • column index and dtype add essential metadata for data validation and processing.
  2. Layout-Preserving Markdown (content):
    • The table is rendered as GitHub-flavored markdown.
    • Crucially, spanned cells are handled by repeating the cell value across the spanned rows/columns. This ensures the markdown table has a regular grid structure, directly loadable into structures like Pandas DataFrames without complex parsing or reconstruction.
    • The first row of the markdown only contains the unique ids from the headers list, providing a clean mapping for data ingestion.
  3. Rich Context (metadata): Captures titles, captions, and notes often surrounding tables, providing essential context that might be lost otherwise.
  4. Downstream Interoperability: The combination of structured headers and the regularized markdown content facilitates seamless conversion to Pandas DataFrames, database schemas, or input for further LLM analysis.

Extracting Structured Tables via SDK

You can leverage vlm-1 with the TableWithLayout schema using the Python SDK. Specify the schema in the GenerationConfig.

import time
from pathlib import Path
from typing import List, Optional, Literal
from pydantic import BaseModel, Field
from vlmrun.client import VLMRun
from vlmrun.client.types import PredictionResponse, GenerationConfig
from vlmrun.dtypes.base import MarkdownPage

# Initialize the client
client = VLMRun(api_key="<VLMRUN_API_KEY>")

# Extract tables from a document using `document.table-markdown`
# that is optimized for structured table extraction
path = Path("path/to/technical_document.pdf")
prediction: PredictionResponse = client.document.generate(
    file=path,
    domain="document.table-markdown"
)

# Access the structured response
page: MarkdownPage = prediction.response
print(page.model_dump_json(indent=2))

Example: Structured Output

Consider the following table with nested headers and merged cells.

Extracting Dense Tables in a Technical Document

The MarkdownTable output captures this complexity:

{
  "metadata": {
    "title": "Sensor Performance Characteristics",
    "caption": "Table 1: Key performance metrics at 25°C",
    "notes": "Typical values unless otherwise noted."
  },
  "headers": [
    {
      "id": "parameter",
      "column": 0,
      "name": "Parameter",
      "dtype": "str"
    },
    {
      "id": "conditions",
      "column": 1,
      "name": "Conditions",
      "dtype": "str"
    },
    {
      "id": "performance_min",
      "column": 2,
      "name": "Performance > Min",
      "dtype": "float"
    },
    {
      "id": "performance_typ",
      "column": 3,
      "name": "Performance > Typ",
      "dtype": "float"
    },
    {
      "id": "performance_max",
      "column": 4,
      "name": "Performance > Max",
      "dtype": "float"
    },
    {
      "id": "units",
      "column": 5,
      "name": "Units",
      "dtype": "str"
    }
  ],
  "content": "| parameter | conditions | performance_min | performance_typ | performance_max | units |\n|---|---|---|---|---|---|\n| Sensitivity | Vcc = 5V | 0.9 | 1.0 | 1.1 | mV/G |\n| Zero Field Output | Vcc = 5V | 2450 | 2500 | 2550 | mV |\n| Linearity | Full Range | -0.5 | 0.1 | 0.5 | % |\n| Noise | 1Hz to 1kHz | - | 150 | - | uVrms |"
}

Example: Rendered Output

The MarkdownTable object contained in the MarkdownPage schema also includes a render method that renders the table as a markdown string.

for table in page.tables:
    print(f"Table [title={table.metadata.title}, caption={table.metadata.caption}]")
    print(table.render()) # renders the table as markdown format

Benefits Demonstrated:

  • Nested Headers: Performance > Min, Performance > Typ, Performance > Max clearly show the hierarchy under Performance.
  • Markdown Ready: The render method returns a string that is valid GitHub-flavored markdown:
    | parameter | conditions | performance_min | performance_typ | performance_max | units |
    |---|---|---|---|---|---|
    | Sensitivity | Vcc = 5V | 0.9 | 1.0 | 1.1 | mV/G |
    | Zero Field Output | Vcc = 5V | 2450 | 2500 | 2550 | mV |
    | Linearity | Full Range | -0.5 | 0.1 | 0.5 | % |
    | Noise | 1Hz to 1kHz | - | 150 | - | uVrms |
    
  • Pandas Integration: The data field in the MarkdownTable object can be easily read into a Pandas DataFrame, with appropriate header metadata such as unique id, column index, name, and dtype. We provide a convenience method to_dataframe to convert the MarkdownTable object to a Pandas DataFrame.
    import pandas as pd
    
    # Convert the extracted markdown table to a pandas dataframe
    df: pd.DataFrame = table.to_dataframe(header="id")
    print(df)
    

Fine-tuning for Domain Specificity

While vlm-1 offers strong general table extraction capabilities, optimal performance on highly specialized or uniquely formatted tables (e.g., specific financial reports, legacy scientific documents) can be achieved through fine-tuning. Consult our fine-tuning guides to adapt the model to your specific table structures and document types, maximizing accuracy and structural fidelity using the TableWithLayout schema.

Try our Document -> JSON API today

Head over to our Document -> JSON to start building your own document processing pipeline with VLM Run. Sign-up for access on our platform.