Kalman AI

Why evaluation is the new battleground

The first generation of LLM products were judged on whether they worked at all. Could the model answer the question? The bar was function.

Production AI is judged on a different bar. Does it work consistently? Does it fail safely? Can it be audited when something goes wrong? Can it be trusted to make the same decision twice on the same input — and to flag the second-order cases where it shouldn't make a decision at all?

The shift is from can it work to can it be relied on. That shift puts evaluation, quality assurance, and governance at the centre of the engineering effort.

Five layers of a production eval stack

Layer 1 — Custom eval harnesses. Is this specific agent, on this specific data, against this specific decision, actually good enough?

Layer 2 — Online monitoring. Is it still good enough today?

Layer 3 — Deterministic fallback. What happens when it isn't?

Layer 4 — Audit trails. What did it do, and why?

Layer 5 — Red-team and stress testing. What could it do under adversarial or unusual conditions?

The five compound. Skipping any of them produces a system that can pass a demo and fail an audit.

Layer 1: Custom eval harnesses

Public benchmarks measure whether a model can do generic tasks. They do not tell you whether your agent, on your data, against your decisions, is fit for purpose.

A custom eval harness has, at minimum: a representative test set drawn from real workloads, known-good outputs (or a rubric where no single right answer exists), a domain-specific scoring function, and a fixed protocol for running the harness so results are comparable over time.

Two principles separate working harnesses from useless ones. Domain-specific scoring — generic similarity metrics capture surface form, not correctness. Stratified test sets — a single aggregate accuracy number hides regressions that a stratified set surfaces.

A custom eval harness is the single most valuable artefact in an agent project. It is also the thing most projects skip.

Layer 2: Online monitoring

A passing eval harness is a snapshot. The world moves. New inputs, new data, new model releases, and slow distribution shift in user behaviour all degrade an agent's performance without anybody changing the code.

The metrics that matter live: hallucination and groundedness rates, self-consistency, tool-call success rate, distribution shift on input statistics, human override rate, and cost-and-latency at percentile. Online monitoring is dashboards, alerts, and the discipline to look at them.

Layer 3: Deterministic fallback paths

Every production agent should have a non-AI escape hatch. The hatch can be a simple rule, a pre-AI legacy code path kept warm, a queued human task, or a cached prior answer.

Two design decisions matter more than the implementation. The fallback must be on by default — fallback paths that exist in the code but are not wired into the live system are common and useless. Confidence must be explicit — the agent must know and report when it is uncertain so the orchestration layer can decide to fall back.

A working fallback layer turns "the AI failed" from an outage into a degraded mode.

Layer 4: Audit trails

Regulators do not accept "the model decided." They accept "the model was given these inputs, retrieved these documents, called these tools, produced this output, was overridden in this way, and here is the trace."

A working audit trail captures, per invocation: the initiator, the redacted input, retrieval results, every model call (model, version, prompt, response, tokens, latency), every tool call, the output and reported confidence, the routing decisions, and any human override.

Two engineering choices make the difference. Trace by design, not by extraction — the trace is written as the system runs, not reconstructed from logs. Queryable storage, not append-only logs — a trace you cannot search at the level of "every decision in the last 30 days where the agent overrode a deterministic rule" is a trace that won't be looked at.

Layer 5: Red-team and stress testing

Red-teaming finds the failure modes you didn't anticipate: prompt injection (direct and indirect), distribution shift, tool misuse, long-tail combinations, confidentiality probes.

Red-teaming is most useful when it produces concrete artefacts: regression tests added to the eval harness, defensive rules added to the orchestration layer, alerts added to the monitoring stack. Findings that don't make it back into the system don't compound.

Governance — the policy half

A workable AI governance posture includes a clear definition of what the agent is allowed to decide, a change-management process tied to the eval harness, a defined incident-response process, and a risk classification that matches the depth of governance to the stakes.

Governance does not have to be heavy. It does have to be explicit. Implicit governance is what produces a 6 a.m. incident the team didn't see coming.

A useful mental model

Think of the eval and governance layer as the control plane for an AI system, where the agents and models are the data plane.

The data plane runs the workload. The control plane decides what's allowed, watches what happens, catches what fails, records what was done, and sets the policies the data plane follows.

In mature production engineering, separating the control plane from the data plane is one of the architectural patterns that turn fragile systems into operable ones. AI systems benefit from the same discipline. Most don't have it.

The honest summary

If an agent is going to make decisions inside a clinic, a bank, a public utility, or a defence platform, the eval and governance layer is what stands between useful AI system and regulatory incident. Skipping it is not a shortcut; it is a deferred liability.

Evaluation, Quality and Governance — The Layer That Decides Whether Your AI Survives Production