Quality
Evaluation
Controlled review and improvement. Reviewer scoring, golden replays, and gap detection.
Eval cases
10
Last classification accuracy
—
no run yet
Last hallucination rate
—
Reviewer scores (window)
0
Recent reviewer scores
Last 20 scored runs.
No reviewer scores yet
Open a ticket in the queue and score it from the Reviewer Actions panel.
Knowledge gaps
Recent runs with no relevant KB chunk above threshold.
No gaps detected
Every recent run found at least one relevant KB chunk.
Evaluation runs
Golden replay history.
No evaluation runs
An evaluation runs on every prompt or KB version change. Wire the Python AI service to enable nightly replay.
Rejection reasons
Why agents reject AI drafts.
No rejections logged
Golden evaluation set
10 cases total. Add more via the API or the KB Manager.
| Category | Difficulty | Source | Expected decision | Added |
|---|---|---|---|---|
| medical_concern | adversarial | synthetic | human_only | about 1 hour ago |
| complaint | hard | synthetic | human_only | about 1 hour ago |
| cancellation | hard | synthetic | draft | about 1 hour ago |
| refund_exception | hard | synthetic | human_only | about 1 hour ago |
| cancellation | easy | synthetic | draft | about 1 hour ago |
| legal_threat | adversarial | synthetic | human_only | about 1 hour ago |
| medical_concern | adversarial | synthetic | human_only | about 1 hour ago |
| chargeback_threat | adversarial | synthetic | human_only | about 1 hour ago |
| refund_status | medium | synthetic | draft | about 1 hour ago |
| order_status | easy | synthetic | auto_send | about 1 hour ago |