Visual Grounding
Learn how to extract visual groundings / citations from documents.
In certain documents, users may want to visually ground the described visual content by localizing the visual elements in the document. For example, in technical documents, we may want to know exactly which table a specific JSON table
object corresponds to visually. This process is known as visual grounding.
VLM-1 provides the ability to extract visual groundings from documents with a simple interface. The extracted visual groundings can be used to understand the context of the visual elements in the document and to link them to the corresponding textual content. Let’s take a look at an example showcasing visually grounding tables in a hardware spec-sheet.
Example showcasing visually grounding tables.
The corresponding JSON output shows the bounding box coordinates of the visual grounding for each table (T1
and T2
) in the document. This information can be used to link the visual elements to the corresponding textual content in the document.
Get Started with our Document -> JSON API
Head over to our Document -> JSON to start building your own document processing pipeline with VLM-1. Sign-up for access to our API here.