Quality

Evaluation

Controlled review and improvement. Reviewer scoring, golden replays, and gap detection.

Eval cases

Last classification accuracy

—

no run yet

Last hallucination rate

—

Reviewer scores (window)

Last 20 scored runs.

Open a ticket in the queue and score it from the Reviewer Actions panel.

Recent runs with no relevant KB chunk above threshold.

Every recent run found at least one relevant KB chunk.

Golden replay history.

An evaluation runs on every prompt or KB version change. Wire the Python AI service to enable nightly replay.

Why agents reject AI drafts.

10 cases total. Add more via the API or the KB Manager.

Category	Difficulty	Source	Expected decision	Added
medical_concern	adversarial	synthetic	human_only	about 1 hour ago
complaint	hard	synthetic	human_only	about 1 hour ago
cancellation	hard	synthetic	draft	about 1 hour ago
refund_exception	hard	synthetic	human_only	about 1 hour ago
cancellation	easy	synthetic	draft	about 1 hour ago
legal_threat	adversarial	synthetic	human_only	about 1 hour ago
medical_concern	adversarial	synthetic	human_only	about 1 hour ago
chargeback_threat	adversarial	synthetic	human_only	about 1 hour ago
refund_status	medium	synthetic	draft	about 1 hour ago
order_status	easy	synthetic	auto_send	about 1 hour ago