Skip to main content
Analyze and understand user interface elements in screenshots and application images. Perfect for automated testing, design system validation, accessibility auditing, and mobile app analysis.
UI parsing example showing UI element detection and classification with interactive elements

UI parsing example showing element detection and classification with interactive elements

UI VQA & Grounding
UI VQA & Grounding
Web Interface
Web interface UI parsing

Examples of UI parsing for different interface types.

Usage Example

For UI parsing, we highly recommend using the Structured Outputs API to get the UI elements and hierarchy in a structured and validated data format.

from pydantic import BaseModel, Field

class UIElement(BaseModel):
  type: str = Field(..., description="Type of UI element (button, input, text, etc.)")
  text: str | None = Field(None, description="Text content of the element")
  interactive: bool = Field(..., description="Whether the element is interactive")
  xywh: tuple[float, float, float, float] = Field(..., description="Bounding box coordinates")

class UIResponse(BaseModel):
  elements: list[UIElement] = Field(..., description="List of detected UI elements")

# Initialize the client
client = openai.OpenAI(
    base_url="https://agent.vlm.run/v1/openai",
    api_key="<VLMRUN_API_KEY>"
)

# Parse UI elements in the image
response = client.chat.completions.create(
    model="vlm-agent-1",
    messages=[
        {
          "role": "user",
          "content": [
            {"type": "text", "text": "Analyze all UI elements in this mobile app screenshot"},
            {"type": "image_url", "image_url": {"url": "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/web.ui-automation/win11.jpeg", "detail": "auto"}}
          ]
        }
    ],
    response_format={"type": "json_schema", "schema": UIResponse.model_json_schema()},
)

# Print the response
print(response.choices[0].message.content)

# Validate the response
print(UIResponse.model_validate_json(response.choices[0].message.content))
>>> UIResponse(elements=[UIElement(type="button", text="Sign In", interactive=True, xywh=(0.25, 0.5, 0.04, 0.02)), ...])

FAQ

UI Parsing is the process of analyzing UI elements in screenshots and application images to identify UI elements, buttons, and interactive components for automated testing.
UI VQA & Grounding is the process of asking specific questions about the UI elements in screenshots and application images to identify UI elements, buttons, and interactive components for automated testing. This is different from UI parsing, where all UI elements are returned. In most cases, you should use UI VQA & Grounding to get more accurate results.