A visual guide to the MarkdownPage schema used for document extraction and processing.
MarkdownDocument
schema is the cornerstone of VLM Run’s document processing system, providing a standardized, machine-readable representation of complex documents. This technical reference guide details the schema’s architecture, components, and implementation patterns.
MarkdownDocument
Data ModelMarkdownDocument
schema addresses the fundamental challenges in document processing:
MarkdownPage
MarkdownDocument
is a list of MarkdownPage
objects, each representing a page in the document.
MarkdownPage
schema:Tabular Representation of `MarkdownPage`
Component | Field | Type | Description |
---|---|---|---|
MarkdownDocument | |||
pages | List[MarkdownPage] | Pages in the document | |
MarkdownPage | |||
metadata | PageMetadata | Metadata of the page | |
tables | List[Table] | Tables in the page | |
figures | List[Figure] | Figures in the page | |
content | str | Content of the page | |
PageMetadata | |||
language | str | Language of the document | |
page_number | int | Page number of the document (0-indexed) | |
Table | |||
metadata.title | str | Title of the table | |
metadata.caption | str | Caption of the table | |
metadata.notes | str | Notes about the table | |
headers.id | str | Unique identifier for the header | |
headers.column | int | Column index of the header | |
headers.name | str | Name of the header | |
headers.dtype | str | Data type of the header | |
data.* | dict[str, Any] | Maps column header ids to values | |
bbox | BoxCoords | Bounding box of the table | |
Figure | |||
id | int | Unique identifier for the figure | |
title | str | Title of the figure | |
caption | str | Caption of the figure | |
bbox | BoxCoords | Bounding box of the figure |
MarkdownTable
<Table id="tb-{id}"/>
tag in the markdown content, with the actual table content stored in the tables
list. This allows for rich representation of table’s data while maintaining the document’s flow.
<Chart id="ch-{id}"/>
tag in the content. The chart details are stored in the figures
list, including properties like:
MarkdownPage
model is used to process a document: