How We Test AI Agents Before Shipping Them to Clients
Inside our four-layer QA process for AI agents. How we catch hallucinations, tool errors, and edge cases before agents ever touch production data.
Roughly 1 in 4 AI agents we audit from other vendors fail a basic test on day one. They call the wrong tool, hallucinate a customer ID, or quietly skip a step when the input looks unusual. None of those failures show up in a demo. They show up the first time the agent meets real production data, often in front of a real customer. So before any agent we build leaves our studio, it goes through a four-layer test that we treat as non-negotiable.
This post walks through that process. It is the same pipeline we use whether the agent is answering support tickets, processing invoices, or routing leads. The names of the layers will sound familiar to anyone who has shipped traditional software. The mechanics are different.
Why testing AI agents is different
Traditional software is deterministic. Same input, same output, every time. AI agents are probabilistic. The same input can produce three slightly different outputs, and one of them can be wrong in a way the other two are not. That changes what testing has to do.
You are no longer just checking whether the code runs. You are checking whether the model reasons correctly under pressure, whether it picks the right tool, whether it gives up when it should, and whether it stays inside the lane you defined. Most teams treat agents like APIs and skip this step. That is the single biggest reason AI projects stall in pilot.
We learned this the hard way on early builds. An agent that scored 95% on our internal benchmark would still fail on a real customer message that contained a typo, a date in the wrong format, or an attachment we had not anticipated. We needed a structure that caught those gaps before clients did.
The four layers we run on every agent
Every agent we ship goes through the same four layers, in order. If a layer fails, we fix it before moving to the next one. No exceptions.
Layer 1: Unit tests for tool calls
The first layer treats the agent like any other function. We write deterministic tests for every tool the agent can call. If the agent has a lookup_customer tool, we test that it returns the right schema, handles missing IDs, and fails loudly on bad input.
This sounds basic, and it is. It is also the layer most teams skip because it feels too simple to matter. We have caught tool-level bugs in production agents three times in the last quarter. Twice it was a date format mismatch. Once it was a tool that silently truncated a list when the response was over 50 items.
If the tools are not solid, nothing the model does on top of them matters.
Layer 2: Behavioral evals
The second layer is where we test the agent itself. We build a dataset of input-output pairs that represent the full range of behavior we expect, including the messy edge cases. Then we run the agent against that dataset on every change to the prompt, the model, or the tools.
Each row in our eval set has a clear pass criterion. Sometimes that is exact text match. More often it is a rubric: did the agent call the correct tool, in the correct order, with the correct arguments. We grade those automatically using a separate evaluator model with a strict prompt.
A few things we have learned about behavioral evals:
- Start with 50 to 100 examples. Anything less is noise. Anything more is hard to maintain.
- Include the failure modes you have already seen in production. Those are the regressions you cannot afford.
- Track scores over time. A drop from 94% to 89% on a prompt change is a signal, not a rounding error.
Layer 3: Adversarial probes
The third layer is where we try to break the agent. We feed it inputs designed to confuse it: contradictory instructions, prompt injection attempts, requests outside its scope, deliberately ambiguous data, and edge cases we know other agents have failed on.
The goal is not to catch every possible misuse. The goal is to find the classes of failure the agent is vulnerable to. If the agent leaks system instructions when a user politely asks it to repeat them, that is a class of failure we fix once and test for forever. If the agent answers questions it should refuse, same thing.
We keep a running library of adversarial prompts that grows with every project. New agents inherit the whole library on day one, which means they ship more secure than the first ones we built two years ago.
Layer 4: Shadow mode in production
The final layer is the one we trust most. Before an agent goes live, it runs in shadow mode against real production traffic. It sees real inputs, generates real outputs, and logs everything, but its responses never reach the user. A human reviews a sample of the outputs daily.
Shadow mode usually runs for one to two weeks depending on volume. We are looking for the things our offline tests cannot catch: data we did not expect, formats we did not plan for, and patterns of usage that only show up at scale.
Around 30% of agents need at least one prompt or tool change during shadow mode. That is not a sign the earlier layers failed. It is a sign that production data always surprises you, and that catching the surprise before customers do is the whole point.
The metrics we track
Across all four layers, we track the same five metrics for every agent:
- Task completion rate. Did the agent finish the task it was given.
- Tool accuracy. Did it call the right tool with the right arguments.
- Refusal accuracy. Did it correctly refuse out-of-scope requests.
- Latency. How long did it take from input to final output.
- Cost per task. How much did this run cost in tokens.
The first three tell us if the agent works. The last two tell us if it works at a price the client can afford to scale. An agent that gets the right answer 99% of the time but costs $4 per task is not shippable for most use cases. We measure both from day one.
Why most teams skip this
The honest answer is that testing agents looks expensive. A proper eval set takes time to build. Shadow mode delays go-live by a week or two. Adversarial probes feel like work for a problem that has not happened yet.
The cost of skipping it is much higher. We have inherited agents from other vendors that failed silently in production for months because nobody had written tests for the unhappy path. By the time the client noticed, the trust was gone. Testing is the cheapest insurance an AI project has. It just has to be built into the timeline from the start, not bolted on at the end.
If you are scoping an AI agent project, the way we scope AI projects is one of the reasons our agents tend to ship and stay shipped. Testing is not a phase we add at the end. It is a parallel track that runs from day one.
How to start testing your agents today
If you are running an agent in production and have not tested it past a demo, here is the smallest version of this process that still helps:
- Write 20 input-output pairs that cover your most common use case. Run them every time the prompt changes.
- Add 5 adversarial prompts that try to make the agent do something out of scope. Run those too.
- Turn on full logging for one week and read every output yourself. You will find at least one issue you did not know existed.
That alone will put you ahead of most teams shipping AI agents today. The full pipeline is more thorough, but the gap between zero testing and minimal testing is the one that matters most.
If you want a deeper audit of an agent already in production, get in touch. We do agent reviews as standalone engagements and the report usually pays for itself within a quarter.
Share this article