Evaluations

VLM Run Evaluations compare stored model outputs from your organization’s traffic against human corrections recorded as feedback. They surface per-field accuracy, completion rates, and (for skill-based runs) structured hints you can feed into optimization and reruns.

What is being compared?

For each item in scope (prediction request or agent execution, depending on the source), the platform:

Loads the original structured response the model produced (from durable storage).
Loads the ground truth from feedback—either JSON you supplied in the response field or JSON inferred from text notes when Infer corrections is enabled.
Runs one or more evaluators on those pairs and aggregates metrics.

Items in the selected window that never received feedback still count toward totals and API completion rate, but only corrected samples drive scorer metrics like field accuracy.

How evaluations work

Collect feedback — Use the Feedback & fine-tuning flow so corrections are tied to request or execution IDs.
Run an evaluation — In the dashboard, pick a skill, agent, or request domain, set a date range, choose evaluators, and start the job. The run is asynchronous: you get a run record immediately and results fill in when scoring finishes.
Review and act — Inspect summaries, per-field breakdowns, and samples. For skill evaluations only, you can Optimize (new skill version) or Rerun (auto-optimize, re-process items, re-score).

Run an evaluation in the dashboard

Open Evaluations

In the dashboard sidebar, choose Evaluations (same area as Overview, Requests, Executions, and Completions).

Start an evaluation and choose a source

Click New Evaluation to open the configuration panel, then pick what you are scoring:

Skill — One skill ID; includes matching prediction requests and agent executions that used that skill in the date range.
Agent — One agent (ID and version); uses agent executions in range.
Request domain — A hub domain string on requests (for example document.invoice); uses prediction requests for that domain.

Set the date range

Choose the time window. The UI shows a preview: total items, how many have feedback, how many include JSON corrections, and the latest activity timestamp—so you can confirm the slice before you spend credits.

Aim for at least 10–20 items with feedback before treating metrics as stable.

Choose evaluators

Select one or more strategies. Defaults lean on Field accuracy; see Evaluator types for tradeoffs.

Select fields (optional)

Optionally restrict evaluation to specific fields by passing a list of top-level keys or dotted leaf-path prefixes (e.g. vendor_name, address, line_items). When omitted, all fields in the JSON output are evaluated.This is useful when you only care about a few critical fields, or want to exclude noisy fields from accuracy metrics. See Field selection for matching rules and API examples.

Infer corrections (optional)

Infer corrections is on by default in the dashboard. When it is on, the platform can turn notes-only feedback into structured JSON using an LLM, guided by your skill schema when available.

Inferred JSON is best-effort. For production-grade ground truth, prefer explicit JSON in feedback; see Feedback & fine-tuning.

Run

Submit the run. Progress updates appear while workers load responses, build evaluation cases, and execute evaluators.

Review results

Open a completed run from the history table.

Overall accuracy — Rolled up from field-level matches vs mismatches (see field accuracy below), including samples without corrections where the model applies an “accepted” assumption for reporting.
Accuracy delta — Change vs the previous completed run for the same source label/type when one exists.
API completion rate — Share of in-scope items whose upstream API calls completed (vs failed or incomplete).
Total samples — Items evaluated in this run.

Field breakdown

The Field-by-Field Comparison block lists each extracted field so you can see where the model disagrees with ground truth.

Use the row filters at the top right—All, Errors, or Correct—to narrow the table; Errors is useful when you only want fields that still need work.

Column	Meaning
Field	Schema field name (for example `Member_ID`, `Provider_Address`).
Correct	Count of samples where this field matched the correction (shown in green in the UI).
Errors	Count of mismatches for that field (shown in red).
Accuracy	Share of samples where this field was correct.

Rows are ordered so the weakest fields surface first. Expand a field to open an inline drill-down: a nested table with Original (model output) and Corrected (feedback) side by side so you can compare concrete values.

Optimization hints (skill runs only)

When the source is a skill, completed results can include optimization hints: weakest fields, recurring themes from notes, a small set of representative failures, and short suggested next steps. These align with what the Optimize action uses when generating a new skill version.

Metrics and history

The main Evaluations page aggregates recent runs:

Summary cards — Mix of run counts, accuracy, and trend cues by source type.
Trend chart — Accuracy over recent completed runs (filterable by source).
Cross-run field view — Weakest fields across history.
History table — Status, accuracy, source, timestamps; row actions depend on status and source type.

Evaluation sources (API naming)

The product and API distinguish sources like this:

Dashboard concept	`source_type` in APIs	What is scored
Skill	`skill`	Requests and executions using that skill ID in the window
Agent	`agent`	Executions for that agent ID/version
Request domain	`request_domain`	Requests tagged with that domain string

The preview endpoint uses the same identifiers so you can validate counts before running.

Evaluator types

You can combine evaluators in one run. In API payloads, they are defined as follows:

UI label	API ID	Role	Field selection
Field accuracy	`field_accuracy`	Default. Per-field leaf comparison with light type coercion (e.g. numeric `"1"` vs `1.0`).	Yes
Fuzzy match	`fuzzy_field_match`	String similarity on text leaves; 0.7 similarity threshold by default. Non-strings fall back to equality.	Yes
LLM judge	`llm_judge`	Semantic score from a judge model over the full structured output.	No
Exact match	`equals_expected`	Strict per-field equality for short expected values only (see below).	Yes

Field accuracy (`field_accuracy`)

Flattens nested JSON into leaf paths, compares model output to ground truth, and reports match rate per field and overall. This is the primary signal for the dashboard’s field table and optimization hints.

Fuzzy field match (`fuzzy_field_match`)

For string leaves, uses the better of character-level and token-level similarity so short labels and long sentences are both scored fairly. A pair passes if similarity ≥ 0.7; otherwise it fails that leaf’s fuzzy check.

LLM judge (`llm_judge`)

Evaluates whether the structured output is semantically acceptable versus the correction, with a 0–1 score per sample. Use it when paraphrases or layout differences should not count as errors. Enabling it adds model latency and cost relative to deterministic evaluators.

Exact match (`equals_expected`)

Compares leaves with strict equality (including whitespace and casing). To avoid long free-text fields dominating the metric, only leaves whose expected string length is ≤ 128 characters participate; longer leaves are skipped for this evaluator’s aggregate. If nothing qualifies on a sample, that sample may be omitted from the exact-match rate.

LLM judge and Exact match are independent toggles—turning on the judge does not replace exact match; choose both only when you want both signals.

Choosing a combination

Scenario	Good starting set
Structured extraction	`field_accuracy` + `fuzzy_field_match`
Lots of free text	Add `llm_judge`
Tight IDs / codes	Add `equals_expected`
Unsure	`field_accuracy` only

Field selection

By default, evaluations score all fields in the JSON output. You can optionally restrict scoring to a subset of fields by passing a fields list of top-level keys or dotted leaf-path prefixes. How matching works:

`fields` value	Matches
`"name"`	`name`, `name.first`, `name.last`
`"address"`	`address`, `address.street`, `address.city`, etc.
`"items"`	`items`, `items[0]`, `items[0].description`, etc.

Field selection applies to Field accuracy, Fuzzy match, and Exact match. The LLM judge always evaluates the entire response regardless of field selection. API usage — Pass fields in the run or rerun request body:

{
  "skill_id": "<skill-id>",
  "evaluators": ["field_accuracy", "fuzzy_field_match"],
  "fields": ["vendor_name", "total_amount", "line_items"],
  "infer_corrections": true
}

When fields is omitted or null, all fields are evaluated (backward-compatible).

Interpreting accuracy

Band	Reading	Next step
95–100%	Strong	Watch for regressions after deploys
85–95%	Solid	Tighten weakest fields surfaced in breakdown
70–85%	Mixed	Inspect samples and hints; adjust prompt/schema
< 70%	Risky	Consider Optimize (skills) or deeper instruction changes

Actions on a run

Optimize (skill evaluations only)

Available when source_type is skill and the run completed. The service samples up to 30 response/correction pairs by default (configurable up to 200 in the API), calls the skill optimizer, and creates a new skill version with the same name. You land on the skill detail page when it succeeds.

Rerun (skill evaluations only)

Also skill-only in the UI. Rerun calls the platform rerun pipeline: it optimizes the source skill first (unless you pass a different skill via API), re-executes the underlying requests/executions against the new skill, copies feedback forward, and computes a fresh evaluation. Expect new prediction traffic and background work; results appear on a new run row.

Agent and domain evaluations still give you full metrics and samples, but Optimize and Rerun are not shown because those flows require a skill to re-execute against.

Delete

Removes the run record (typically allowed for completed or failed runs). It does not delete underlying requests, executions, or feedback.

Continuous improvement

With evaluations, you can continuously improve your processes:

Ship

Collect feedback

Evaluate

Optimize / edit skill

Ship again

Programmatic feedback

Evaluations read feedback you already stored; they do not replace the feedback API. For submission formats, entity IDs (request, agent_execution, chat), and examples, use Feedback & fine-tuning and Submit feedback. The dashboard’s run, optimize, and rerun actions call internal evaluation services (not the public api.vlm.run OpenAPI bundle). Treat the dashboard as the supported surface for those operations unless your integration team has exposed the same routes to you.

Best practices

Prefer JSON corrections in feedback; use Infer corrections when you only have notes.
Use field selection to focus metrics on the fields that matter most, especially during iterative improvement.
Match evaluator choice to schema — fuzzy for noisy text fields, judge for semantic leniency, exact for short codes.
Read optimization hints on skill runs before clicking Optimize to confirm the weakest fields make sense.
Use Rerun as an end-to-end A/B after Optimize, since it reprocesses real inputs rather than only rescoring old JSON.
Keep date windows meaningful — very wide windows mix old prompts with new ones and blur trends.

Feedback & fine-tuning

Ground truth tied to requests and executions.

Submit feedback API

HTTP reference for structured feedback.

Platform

Observe

What is being compared?

How evaluations work

Run an evaluation in the dashboard