Evals before launch: testing agent behaviour like code

Most AI projects die as demos because nobody can answer a simple question: how do you know it still works? The model changed. The prompt changed. A vendor changed an API. In normal software, tests answer that question. In AI systems, evals do – and most teams ship without them.

An eval is not complicated. It is a fixed set of inputs with known-correct answers, a scoring method, and a bar. Before a release, the system runs the set; if the score drops below the bar, the deploy stops. The same discipline software engineering settled on decades ago, applied to agent behaviour.

What we run on a financial orchestration system

Our reconciliation system for an enterprise IT distributor processes vendor invoices, statements, and purchase orders into an ERP. Every release is gated on four checks:

Golden-set reconciliation – a fixed set of invoice, statement, and order triples with known-correct matches. The agent’s matches are scored against ground truth; a regression blocks the deploy.
Document classification against a labelled corpus – invoices, statements, purchase orders, general correspondence – with per-class precision and recall checked against a minimum bar.
Field extraction – known values for amount, date, and vendor on a sample set, compared with what the agent extracted.
Escalation correctness – synthetic exception cases that should reach a human. The test confirms the agent escalates instead of acting alone.

What we run on a real-time voice agent

A voice agent is harder to test than a pipeline – conversations branch. Our 13-state voice system is built so they branch predictably, and the eval suite asserts exactly that:

State-path evals – scripted caller transcripts that must drive a specific route through the states and land in the correct terminal state: booked, declined, callback, or escalated.
Disclosure gate – every call in the corpus must pass through the disclosure state before qualification begins. A call that skips it fails the run. This is also what EU AI Act transparency obligations expect from August 2026.
Booking correctness – given a caller intent to book, the test verifies the right calendar action and CRM record were produced.
Latency budget – time-to-first-response measured across the corpus. Real-time voice fails if it lags, so the budget is part of the suite, not a hope.

What we run on a 30-agent operations platform

On a platform running 30+ agents on local models – inbox triage, KPI reporting, monitoring – the failure modes are different: silent drift, duplicated side effects, hallucinated numbers. The suite targets each:

Inbox-triage scoring against a labelled email set – route target and priority – before any change ships.
KPI grounding – figures in a generated report are recomputed independently from the source database. Any mismatch fails the run. This is the check that catches hallucinated numbers.
Idempotency – the same input replayed twice must produce no duplicate side effects. No double-send, no double-write.
Alerting – a failure is injected; the test confirms an alert fires and reaches the escalation path.

Why this matters more in 2026

Models now ship fast enough that a system tuned in spring can behave differently by autumn. Without evals, every model release is a risk you absorb blind. With them, it is routine maintenance: re-run the suite against the new model, compare, upgrade when the results clear the bar. The thing that scares teams about AI – constant change underneath – becomes the mechanism that keeps the system improving.

Evals are also the audit trail. When an agent touches finance data or speaks to your customers, "it seemed fine in the demo" is not an answer anyone can take to a board, an accountant, or a regulator. A scored suite with history is.

What this means if you work with us

Evaluation criteria are agreed before anything goes live – they are written into the scope, and acceptance is gated on them. Systems on an operations plan get their suite re-run as models ship. And everything is handed over: the eval sets, the scoring, the history. Your system, your proof.

Want your system gated on evals before it goes live? Tell us what you are trying to automate.

Start a conversation →

← All build logs