MarkdownPage
A visual guide to the MarkdownPage schema used for document extraction and processing.
The MarkdownDocument
schema is the cornerstone of VLM Run’s document processing system, providing a standardized, machine-readable representation of complex documents. This technical reference guide details the schema’s architecture, components, and implementation patterns.
MarkdownDocument
Data Model
The MarkdownDocument
schema addresses the fundamental challenges in document processing:
- Structural Preservation: Maintains document hierarchy and relationships
- Content Extraction: Handles mixed content types (text, tables, figures, code)
- Spatial Understanding: Preserves layout and positioning information
- Data Integrity: Ensures accurate representation of structured elements
- Extensibility: Supports custom annotations and metadata
1. MarkdownPage
A MarkdownDocument
is a list of MarkdownPage
objects, each representing a page in the document.
Here’s an alternative way to visualize the MarkdownPage
schema:
Tabular Representation of `MarkdownPage`
Tabular Representation of `MarkdownPage`
Component | Field | Type | Description |
---|---|---|---|
MarkdownDocument | |||
pages | List[MarkdownPage] | Pages in the document | |
MarkdownPage | |||
metadata | PageMetadata | Metadata of the page | |
tables | List[Table] | Tables in the page | |
figures | List[Figure] | Figures in the page | |
content | str | Content of the page | |
PageMetadata | |||
language | str | Language of the document | |
page_number | int | Page number of the document (0-indexed) | |
Table | |||
metadata.title | str | Title of the table | |
metadata.caption | str | Caption of the table | |
metadata.notes | str | Notes about the table | |
headers.id | str | Unique identifier for the header | |
headers.column | int | Column index of the header | |
headers.name | str | Name of the header | |
headers.dtype | str | Data type of the header | |
data.* | dict[str, Any] | Maps column header ids to values | |
bbox | BoxCoords | Bounding box of the table | |
Figure | |||
id | int | Unique identifier for the figure | |
title | str | Title of the figure | |
caption | str | Caption of the figure | |
bbox | BoxCoords | Bounding box of the figure |
2. MarkdownTable
Tables are represented with a <Table id="tb-{id}"/>
tag in the markdown content, with the actual table content stored in the tables
list. This allows for rich representation of table’s data while maintaining the document’s flow.
3. Charts and Figures
Charts and figures are represented with a <Chart id="ch-{id}"/>
tag in the content. The chart details are stored in the figures
list, including properties like:
Example Usage
Here’s an example of how the MarkdownPage
model is used to process a document:
Example JSON Response
Here’s an example of how the MarkdownPage schema appears in a JSON response: