Skip to main content

Overview

VLM Run Evaluations let you systematically measure how well your skills, agents, and domains perform against real-world feedback. By comparing model outputs to human-corrected ground truth, evaluations surface per-field accuracy, highlight weaknesses, and track improvements across runs.
Evaluations are available in the VLM Run Dashboard. You can also trigger evaluations programmatically via the Agent API.

How Evaluations Work

Evaluations follow a three-step loop: collect feedback, run an evaluation, and act on results.
1

Collect Feedback

Before running an evaluation, you need ground-truth data. This comes from the Feedback API — when you submit corrected JSON for a prediction or agent execution, that correction becomes the expected output for evaluation.Feedback can include:
  • JSON corrections — a corrected version of the model’s structured output
  • Notes — free-text comments describing what was wrong (the platform can infer corrections from notes using an LLM)
The more feedback you collect, the more reliable your evaluation metrics become.
2

Run an Evaluation

From the Evaluations page in the dashboard, click New Evaluation and configure:
  1. Source — select the skill, agent, or domain you want to evaluate
  2. Date range — choose the time window for data to include
  3. Evaluators — select one or more scoring strategies (see Evaluator Types below)
  4. Infer corrections — optionally use an LLM to generate corrected JSON from feedback notes
The evaluation runs asynchronously. You can monitor its progress in real time — the dashboard polls for status updates automatically.
3

Review Results

Once complete, the evaluation produces:
  • Overall accuracy — percentage of fields that match the expected output
  • Field-by-field breakdown — per-field accuracy with original vs. corrected values
  • Optimization insights — weakest fields, error patterns, and suggested improvements
  • Sample-level detail — drill into individual predictions to see exactly what differed
Use these results to identify where your skill or prompt needs refinement.

Evaluation Sources

You can evaluate three types of sources:
Source TypeWhat It EvaluatesData Comes From
SkillA specific skill (by ID)Requests and agent executions that used the skill
AgentA specific agent (by ID and version)Agent executions
DomainA pre-built domain (e.g. document.invoice)Prediction requests for that domain
When creating an evaluation, the dashboard shows a preview of available data — total items, number with feedback, and the most recent item date — so you can confirm there is enough ground truth before running.

Evaluator Types

Each evaluator measures accuracy differently. You can select one or more evaluators per run:
Compares each field in the model’s output against the corrected output, field by field. This is the default evaluator and produces:
  • Per-field accuracy — fraction of samples where that field matches
  • Overall accuracy — weighted average across all fields
  • Field breakdown — matched, mismatched, and accepted (no correction provided) counts
This evaluator works best when you have JSON corrections as ground truth.
Uses fuzzy string matching to compare field values, which is more tolerant of minor differences like whitespace, punctuation, or formatting variations. Produces an average fuzzy match rate across all fields.Use this when exact string equality is too strict for your use case (e.g., names, addresses, free-text fields).
Uses an LLM to semantically judge whether the model’s output is correct given the input and expected output. The LLM assigns a score (typically 0–1) based on semantic equivalence rather than exact matching.This evaluator is useful when:
  • Outputs are free-form text or descriptions
  • Multiple phrasings are equally correct
  • You want to evaluate meaning rather than formatting
A strict equality check — the model’s output must be character-for-character identical to the expected output. Produces an exact match rate across all samples.
Exact match is automatically limited to expected outputs of 64 characters or fewer. For longer responses, the LLM Judge evaluator is used instead.

How to Think About Evaluations

The Feedback-Evaluation Loop

Evaluations are most powerful when used as part of a continuous improvement cycle:
┌─────────────┐     ┌──────────────┐     ┌───────────────┐     ┌──────────────┐
│  Deploy a    │────▶│  Collect     │────▶│  Run an       │────▶│  Improve     │
│  Skill       │     │  Feedback    │     │  Evaluation   │     │  the Skill   │
└─────────────┘     └──────────────┘     └───────────────┘     └──────┬───────┘
       ▲                                                              │
       └──────────────────────────────────────────────────────────────┘
  1. Deploy your skill or agent to production
  2. Collect feedback from users — corrected JSON outputs and notes
  3. Run evaluations periodically to measure accuracy
  4. Improve your skill’s prompt, schema, or instructions based on evaluation insights
  5. Repeat — each cycle should improve accuracy

What Metrics to Track

  • Overall accuracy is your primary health indicator. Track it over time using the accuracy trend chart on the dashboard.
  • Field-level accuracy reveals which specific fields are underperforming. Focus optimization efforts on the weakest fields first.
  • API completion rate shows what percentage of requests completed successfully (vs. failed). A drop here indicates infrastructure or schema issues, not prompt quality.
  • Accuracy delta (shown as a change from the previous run) tells you whether your latest changes helped or hurt.

When to Run Evaluations

  • After updating a skill — to verify the change improved accuracy
  • After collecting new feedback — to get a fresh accuracy reading with more ground truth
  • On a regular cadence — weekly or biweekly evaluations help you catch regressions early
  • Before and after optimization — to measure the impact of the platform’s automatic skill optimization

Interpreting Results

Accuracy RangeInterpretationSuggested Action
95–100%Excellent — the skill is performing very wellMonitor for regressions
85–95%Good — minor issues on specific fieldsReview weakest fields and refine prompts
70–85%Fair — significant room for improvementAnalyze error patterns, consider schema changes
Below 70%Needs attention — major accuracy gapsReview optimization insights, consider rewriting instructions

Metrics Dashboard

The evaluations page in the dashboard provides an at-a-glance view of your evaluation health:
  • Summary cards — total evaluation runs, broken down by source type (skills, agents, domains)
  • Accuracy trend chart — visualizes accuracy over the last N runs, filterable by source
  • Field accuracy table — aggregated per-field accuracy across recent runs, sorted by weakest first
  • Evaluation history — a sortable, paginated table of all evaluation runs with status, accuracy, and actions
You can filter the metrics dashboard by source (e.g., a specific skill or agent) and adjust the number of runs included in the trend calculation.

Actions

From the evaluation detail page or the history table, you can take several actions on completed evaluation runs:

Optimize

Automatically optimize the skill’s instructions based on evaluation results. This creates a new skill version with improved prompts.

Rerun

Re-evaluate with a different skill version or different evaluator settings. Useful for A/B testing prompt changes.

Delete

Remove an evaluation run from your history. Only completed or failed runs can be deleted.

Optimize

The Optimize action uses evaluation results to automatically generate an improved version of your skill. It analyzes:
  • Weakest fields and their error patterns
  • Representative failures from the evaluation samples
  • The current skill instructions
It then creates a new skill version with refined instructions that target the identified weaknesses. After optimization, you can rerun the evaluation with the new skill to measure the improvement.
Optimization works best when you have at least 10–20 feedback samples with JSON corrections. The more ground truth data available, the better the optimization.

Rerun

The Rerun action re-evaluates the same data window but with a different skill version. This is useful for:
  • Comparing the original skill against an optimized version
  • Testing different evaluator configurations
  • Verifying that a manual skill update improved accuracy

Best Practices

  • Start with Field Accuracy — it’s the most actionable evaluator and gives you per-field granularity
  • Collect JSON corrections when possible — they produce more reliable evaluations than note-inferred corrections
  • Use the “Infer corrections” toggle when you only have text notes — the LLM will generate a best-effort corrected JSON
  • Focus on the weakest fields — the optimization insights section highlights exactly where to focus
  • Compare runs over time — the accuracy trend chart is your best tool for tracking skill improvements
  • Use version pinning — when rerunning evaluations, pin specific skill versions so results are reproducible

Feedback Guide

Learn how to collect the ground-truth data that powers evaluations

Skills Introduction

Understand how skills work and how to create them