Quality

Evaluation

Controlled review and improvement. Reviewer scoring, golden replays, and gap detection.

Eval cases
10
Last classification accuracy
no run yet
Last hallucination rate
Reviewer scores (window)
0

Recent reviewer scores

Last 20 scored runs.

No reviewer scores yet

Open a ticket in the queue and score it from the Reviewer Actions panel.

Knowledge gaps

Recent runs with no relevant KB chunk above threshold.

No gaps detected

Every recent run found at least one relevant KB chunk.

Evaluation runs

Golden replay history.

No evaluation runs

An evaluation runs on every prompt or KB version change. Wire the Python AI service to enable nightly replay.

Rejection reasons

Why agents reject AI drafts.

No rejections logged

Golden evaluation set

10 cases total. Add more via the API or the KB Manager.

CategoryDifficultySourceExpected decisionAdded
medical_concernadversarialsynthetichuman_onlyabout 1 hour ago
complainthardsynthetichuman_onlyabout 1 hour ago
cancellationhardsyntheticdraftabout 1 hour ago
refund_exceptionhardsynthetichuman_onlyabout 1 hour ago
cancellationeasysyntheticdraftabout 1 hour ago
legal_threatadversarialsynthetichuman_onlyabout 1 hour ago
medical_concernadversarialsynthetichuman_onlyabout 1 hour ago
chargeback_threatadversarialsynthetichuman_onlyabout 1 hour ago
refund_statusmediumsyntheticdraftabout 1 hour ago
order_statuseasysyntheticauto_sendabout 1 hour ago