AI · Risk
Evals are a risk control. Most AI deployments are missing them.
If you cannot quantify the failure rate of a deployed AI system, you cannot say what its residual risk is — and the regulators are starting to ask. Evals are the discipline that closes the gap, and they are simpler to start than the literature implies.
For every other system in production, the residual risk question has a quantitative answer — failure rate per million transactions, mean time between failures, percent of correct decisions on a held-out dataset. For AI systems, the answer is usually qualitative — “we have done significant testing and have not seen issues” — which is the answer that regulators are increasingly unwilling to accept. The discipline that closes the gap is evals, and it is much easier to start than the published literature suggests.
An eval, in operational terms, is a test set with a measurable pass criterion that you run on the deployed system regularly. The test set represents the operational reality. The pass criterion represents the acceptable failure rate. The cadence represents the assurance frequency. Three things; the rest is implementation.
The reason evals do not exist in most deployments is that the conversation gets stuck in the wrong place. Engineering teams hear “evals” and think benchmarks — MMLU, HumanEval, MT-Bench. Risk teams hear “evals” and think red-teaming exercises. Neither is the right starting point for a deployed business system. The right starting point is the answer to one question: in the use case this system is deployed for, what does failure look like, and how often is it acceptable?
Worked examples from recent engagements:
A claims-triage AI for a general insurer. Failure looks like routing a non-trivial claim to the auto-process queue, or a trivial claim to the human queue. The eval set is two hundred historical claims with the correct routing decision, sampled across claim types and complexity. The pass criterion is 98% correct on auto-process (false negatives are expensive) and 90% correct on human-queue (false positives create cost but not customer harm). The cadence is weekly, with the result going to the operations dashboard.
An internal contract-review agent for a corporate legal team. Failure looks like missing a non-standard clause that should have been flagged. The eval set is one hundred contracts annotated by senior counsel for known issues, half real, half synthetic with known plants. The pass criterion is 95% recall on flagged clauses, with precision tracked as a secondary metric (low precision is annoying, low recall is dangerous). The cadence is monthly, with the result feeding the model selection conversation.
A customer-support chatbot at a regional bank. Failure looks like giving incorrect information that creates customer detriment, or escalating things that did not need escalation. The eval set is three hundred conversation transcripts with the correct outcome scored by a senior customer service manager. The pass criterion is two-axis — accuracy on factual claims (must be 99%+) and appropriate escalation (must be within band). The cadence is daily because the underlying model gets updated.
The gap between what these look like and what the published eval literature looks like is roughly the difference between a software unit test and a research benchmark. The unit test is about the specific system in production. The benchmark is about general capability. Production deployments need the unit test version. Build that first.
The framework alignment is worth noting. NIST AI RMF’s Measure function reads almost as instructions for an evals program — establish metrics, baseline performance, monitor drift, track aggregate impact. ISO 42001’s clause 9 (performance evaluation) similarly maps directly. APRA’s expectations under CPS 230 for operational risk include the requirement to monitor the performance of material services, and an AI system in production is in scope if it is part of one. The auditor question that an evals program answers is “how do you know this system is performing as intended?” The answer “we run an eval set weekly and the pass rate is X” is the answer that ends the conversation. The answer “we have done significant testing during development” is the answer that opens it.
The investment to start is small. A useful eval set is dozens to hundreds of cases, not thousands. The harness can be a Python script in a CI job. The dashboard is a row in the existing operations dashboard. The hardest part is curating the test set and choosing the pass criterion, both of which require domain judgment rather than ML expertise. For mid-market deployments, the entire stand-up is achievable in a few weeks of focused effort. For most CISOs we work with, the value of having a quantitative answer to “how is the AI performing in production” the next time the board asks is enough to justify the work on its own.
Continue reading
Related pieces
AI · Authorisation
MCP and the new authorisation surface nobody is reviewing
Model Context Protocol turns every internal API into a tool an agent can call on a user's behalf. The authorisation model most teams ship with is naïve, and the audit log usually proves it.
29 April 2026
ISO/IEC 42001
ISO 42001 readiness for mid-market organisations
What an ISO/IEC 42001 management system actually requires, and what it does not, for organisations under 500 staff. A pragmatic readiness path that does not require a dedicated AI governance team.
18 February 2026
Email security
AI-assisted phishing: what's actually new
The volume of AI-assisted phishing has gone up; the success rate per attempt has not changed as much as the headlines suggest. The substantive change is the resource asymmetry — and what it means for your defensive program.
8 October 2025