Overview
VLM Run Evaluations let you systematically measure how well your skills, agents, and domains perform against real-world feedback. By comparing model outputs to human-corrected ground truth, evaluations surface per-field accuracy, highlight weaknesses, and track improvements across runs.Evaluations are available in the VLM Run Dashboard. You can also trigger evaluations programmatically via the Agent API.
How Evaluations Work
Evaluations follow a three-step loop: collect feedback, run an evaluation, and act on results.Collect Feedback
Before running an evaluation, you need ground-truth data. This comes from the Feedback API — when you submit corrected JSON for a prediction or agent execution, that correction becomes the expected output for evaluation.Feedback can include:
- JSON corrections — a corrected version of the model’s structured output
- Notes — free-text comments describing what was wrong (the platform can infer corrections from notes using an LLM)
Run an Evaluation
From the Evaluations page in the dashboard, click New Evaluation and configure:
- Source — select the skill, agent, or domain you want to evaluate
- Date range — choose the time window for data to include
- Evaluators — select one or more scoring strategies (see Evaluator Types below)
- Infer corrections — optionally use an LLM to generate corrected JSON from feedback notes
Review Results
Once complete, the evaluation produces:
- Overall accuracy — percentage of fields that match the expected output
- Field-by-field breakdown — per-field accuracy with original vs. corrected values
- Optimization insights — weakest fields, error patterns, and suggested improvements
- Sample-level detail — drill into individual predictions to see exactly what differed
Evaluation Sources
You can evaluate three types of sources:| Source Type | What It Evaluates | Data Comes From |
|---|---|---|
| Skill | A specific skill (by ID) | Requests and agent executions that used the skill |
| Agent | A specific agent (by ID and version) | Agent executions |
| Domain | A pre-built domain (e.g. document.invoice) | Prediction requests for that domain |
Evaluator Types
Each evaluator measures accuracy differently. You can select one or more evaluators per run:Field Accuracy
Field Accuracy
Compares each field in the model’s output against the corrected output, field by field. This is the default evaluator and produces:
- Per-field accuracy — fraction of samples where that field matches
- Overall accuracy — weighted average across all fields
- Field breakdown — matched, mismatched, and accepted (no correction provided) counts
Fuzzy Match
Fuzzy Match
Uses fuzzy string matching to compare field values, which is more tolerant of minor differences like whitespace, punctuation, or formatting variations. Produces an average fuzzy match rate across all fields.Use this when exact string equality is too strict for your use case (e.g., names, addresses, free-text fields).
LLM Judge
LLM Judge
Uses an LLM to semantically judge whether the model’s output is correct given the input and expected output. The LLM assigns a score (typically 0–1) based on semantic equivalence rather than exact matching.This evaluator is useful when:
- Outputs are free-form text or descriptions
- Multiple phrasings are equally correct
- You want to evaluate meaning rather than formatting
Exact Match
Exact Match
A strict equality check — the model’s output must be character-for-character identical to the expected output. Produces an exact match rate across all samples.
Exact match is automatically limited to expected outputs of 64 characters or fewer. For longer responses, the LLM Judge evaluator is used instead.
How to Think About Evaluations
The Feedback-Evaluation Loop
Evaluations are most powerful when used as part of a continuous improvement cycle:- Deploy your skill or agent to production
- Collect feedback from users — corrected JSON outputs and notes
- Run evaluations periodically to measure accuracy
- Improve your skill’s prompt, schema, or instructions based on evaluation insights
- Repeat — each cycle should improve accuracy
What Metrics to Track
- Overall accuracy is your primary health indicator. Track it over time using the accuracy trend chart on the dashboard.
- Field-level accuracy reveals which specific fields are underperforming. Focus optimization efforts on the weakest fields first.
- API completion rate shows what percentage of requests completed successfully (vs. failed). A drop here indicates infrastructure or schema issues, not prompt quality.
- Accuracy delta (shown as a change from the previous run) tells you whether your latest changes helped or hurt.
When to Run Evaluations
- After updating a skill — to verify the change improved accuracy
- After collecting new feedback — to get a fresh accuracy reading with more ground truth
- On a regular cadence — weekly or biweekly evaluations help you catch regressions early
- Before and after optimization — to measure the impact of the platform’s automatic skill optimization
Interpreting Results
| Accuracy Range | Interpretation | Suggested Action |
|---|---|---|
| 95–100% | Excellent — the skill is performing very well | Monitor for regressions |
| 85–95% | Good — minor issues on specific fields | Review weakest fields and refine prompts |
| 70–85% | Fair — significant room for improvement | Analyze error patterns, consider schema changes |
| Below 70% | Needs attention — major accuracy gaps | Review optimization insights, consider rewriting instructions |
Metrics Dashboard
The evaluations page in the dashboard provides an at-a-glance view of your evaluation health:- Summary cards — total evaluation runs, broken down by source type (skills, agents, domains)
- Accuracy trend chart — visualizes accuracy over the last N runs, filterable by source
- Field accuracy table — aggregated per-field accuracy across recent runs, sorted by weakest first
- Evaluation history — a sortable, paginated table of all evaluation runs with status, accuracy, and actions
Actions
From the evaluation detail page or the history table, you can take several actions on completed evaluation runs:Optimize
Automatically optimize the skill’s instructions based on evaluation results. This creates a new skill version with improved prompts.
Rerun
Re-evaluate with a different skill version or different evaluator settings. Useful for A/B testing prompt changes.
Delete
Remove an evaluation run from your history. Only completed or failed runs can be deleted.
Optimize
The Optimize action uses evaluation results to automatically generate an improved version of your skill. It analyzes:- Weakest fields and their error patterns
- Representative failures from the evaluation samples
- The current skill instructions
Rerun
The Rerun action re-evaluates the same data window but with a different skill version. This is useful for:- Comparing the original skill against an optimized version
- Testing different evaluator configurations
- Verifying that a manual skill update improved accuracy
Best Practices
- Start with Field Accuracy — it’s the most actionable evaluator and gives you per-field granularity
- Collect JSON corrections when possible — they produce more reliable evaluations than note-inferred corrections
- Use the “Infer corrections” toggle when you only have text notes — the LLM will generate a best-effort corrected JSON
- Focus on the weakest fields — the optimization insights section highlights exactly where to focus
- Compare runs over time — the accuracy trend chart is your best tool for tracking skill improvements
- Use version pinning — when rerunning evaluations, pin specific skill versions so results are reproducible
Feedback Guide
Learn how to collect the ground-truth data that powers evaluations
Skills Introduction
Understand how skills work and how to create them