Testing lab

Validate agents with scenario, regression, approval, and benchmark testing before release.

Testing Lab combines manual test runs, golden dataset checks, review scoring, execution telemetry, and release readiness checks in one bounded environment.

3Test cases

1Passed

1Failed executions

2Release-ready agents

agt-001passScore 92

Passed with one manual review branch.

Correctness and policy compliance stayed above threshold.

Reviewed by Prometa QA · run at 2026-03-15T15:00:00Z

agt-002reviewScore 81

Escalation triggered correctly, relevance score dipped on routing note.

Review copy needs refinement before publish.

Reviewed by Claims Reviewer · run at 2026-03-15T16:30:00Z

agt-003failScore 68

Variance exceeded target for one peak cluster.

Regression breach blocks release readiness.

Reviewed by Planning QA · run at 2026-03-15T18:00:00Z

Evaluation engine

Exact match, semantic similarity, rule validation, and human scoring can run on the same test asset.

Scoring functions

Exact match · semantic similarity · rule-based validation · human review scoring.

Regression testing

Compare version vs version to prevent silent behavior drift before release.

A/B testing

Run agent_v1 vs agent_v2 or workflow branch vs branch under the same dataset.

Batch runner

Asynchronous evaluation runner supports multiple test cases, versions, and score summaries.

Release readiness

Deployment readiness stays explicit before publish to pilot or production.

agt-001production

Tests passed: yes

Integrations healthy: yes

Governance approved: yes

Rollback path: available

agt-002pilot

Tests passed: yes

Integrations healthy: yes

Governance approved: yes

Rollback path: available

agt-003test

Tests passed: no

Integrations healthy: yes

Governance approved: no

Rollback path: missing

Execution trace

Trace-backed evaluation view for regression and approval testing.

Trace `/trace-001` exposes orchestration steps, tool calls, decision points, latency, token cost, and suggested remediation without opening backend internals.

Scenario ingestsuccess

Scenario pack normalized

Tool calls

Forecast API

Decisions

Peak cluster anomaly detected

Token / cost

1240 in · 318 out · $2.10

Latency breakdown

model 420 ms · tool 980 ms · orchestration 210 ms

Planner review routingwaiting_review

Escalated to planner queue

Tool calls

Planner Review UI

Decisions

Confidence below threshold

Token / cost

220 in · 91 out · $0.60

Latency breakdown

model 180 ms · tool 220 ms · orchestration 140 ms

Explain this execution

Decision path stayed inside registered tools, applied runtime guardrails, and routed to review whenever confidence or policy state required intervention.

Fix suggestion engine

Planner scenario payload missing one required pricing field. Suggested fix: Add schema validation in input handler and inject default pricing floor before review.