Behind the ScenesMay 13, 20267 min read

Inside How We Monitor AI Agents in Production

Our four-layer AI agent monitoring stack for catching drift, hallucinations, and tool errors before customers do. Here's exactly how we keep production agents honest.

Agent Monitoring

Roughly 6 in 10 production AI agents we audit have silently degraded since launch. The agent that nailed every test in week one is, by month three, calling the wrong tool 4% of the time, citing data that doesn't exist, or quietly skipping steps when input formats shift. The team running it usually has no idea. AI agent monitoring is the difference between a system that earns trust and a system that loses it the moment something changes. Here is exactly how we monitor every agent we ship.

The reason most AI agents quietly fail

Traditional software fails loudly. A null reference throws. A 500 error fires. Logs scream. AI agents fail quietly. They keep responding. They keep calling tools. They keep producing output that looks confident. The failure mode is plausible wrongness, not visible breakage.

That makes ordinary observability stacks insufficient. Datadog will tell you the agent is up. Sentry will tell you no exception was thrown. Neither will tell you the agent has started hallucinating customer IDs or that its CRM update success rate has dropped from 98% to 81% over the last three weeks.

The only way to catch quiet failure is to measure quality, not just uptime.

What we actually monitor

Before we ship an agent, we define a set of measurable signals that map to whether the agent is doing its job. Every agent we build through our automate practice gets the same baseline.

Tool call accuracy

For agents that take actions in external systems, the most important metric is whether the right tool fires with the right arguments. We log every tool call, its inputs, its outputs, and whether the downstream system accepted the result. An agent that should be updating a Salesforce opportunity but is instead creating duplicate contacts is the kind of failure that costs real money and never throws an error.

Output grounding

For agents that retrieve information and respond in natural language, we measure how often the output cites a real source from the retrieved context versus inventing one. Grounding failures are the cleanest definition of hallucination, and they are measurable when you log the retrieved context alongside the response.

Latency and cost per task

Models drift. Routing rules change. Prompts get longer over time as teams add edge cases. A task that used to cost 2 cents and finish in 4 seconds can quietly become a task that costs 11 cents and takes 18 seconds without anyone noticing. We track both metrics at the task level, not just the request level, so we can spot when a single workflow starts bloating.

User feedback signals

The cheapest way to detect quality regression is to ask the people using the agent. A thumbs up or down on every meaningful interaction, plus an optional one-line reason, gives us a continuous signal that complements automated evals. We treat any sustained drop in thumbs-up rate as a P1.

The four-layer monitoring stack we deploy

Every production agent we ship sits behind these four layers. The stack is intentionally boring. The hard work is in the test sets, not the infrastructure.

Layer 1: Structured event logging

Every agent action gets logged as a structured event: the input, the model used, the tools called, the retrieved context, the final output, and any user feedback. We pipe these into a dedicated observability table, not a generic application log. The point is to make every agent run replayable. If something looks wrong in week eight, we can pull the exact run, replay it locally, and see what happened.

Layer 2: Automated evals on a golden test set

For every agent, we build a golden test set of 80 to 300 cases that cover the most common workflows, the gnarliest edge cases, and the failure modes we caught during build. The evals run on every deploy and once a day in production against the live agent. If accuracy on the golden set drops more than 3% week over week, we get paged.

Building a strong test set is the most underrated work in AI engineering. We typically spend 20% of project time on it, and it is where most teams cut corners.

Layer 3: Anomaly detection and alerting

For metrics that drift gradually, threshold alerts are noisy and slow. We use simple statistical baselines per metric (tool call success rate, average latency, thumbs-down rate, cost per task) and alert on multi-sigma deviations. A 4% drop in tool call success that happens over six weeks would never trip a static threshold. It absolutely trips an anomaly detector.

Layer 4: Human-in-the-loop review cadence

Once a week, a member of our team reviews a random sample of 20 to 50 agent runs across our active client deployments. We rate each one on accuracy, helpfulness, and risk. Patterns surface quickly: a new edge case, a tool returning slightly different JSON, a prompt that has aged poorly. The review takes about 45 minutes per client and has caught more real issues than any automated alert.

How we respond when something drifts

Detection is half the job. The other half is fixing things without breaking the agent further. Our response playbook is consistent across every client.

Triage within 24 hours. Every anomaly gets classified as cosmetic, functional, or risk-bearing. Risk-bearing issues go to active fix immediately.
Reproduce before you change. We replay the exact failing runs locally with the same context and model version before we touch the prompt or tools. Roughly 1 in 5 alerts we investigate turn out to be a downstream system issue, not an agent issue.
Patch via eval, not vibes. Every fix has to pass the existing golden set plus a new test case that captures the failure mode. No silent regressions.
Roll out behind a shadow. For meaningful prompt or model changes, we run the new version against live traffic in shadow mode for 24 to 72 hours and diff its outputs against the live agent before promoting.

This is the same discipline we use when we scope and build new AI projects, pulled forward into the day-to-day of running them.

What this actually gives clients

Monitoring is not a feature. It is what makes an AI agent a system you can rely on for years instead of a demo that ages out in six months.

Drift caught in days, not quarters. A 5% degradation that would normally take a customer complaint to surface gets caught the same week it starts.
Honest performance metrics. Clients get a monthly report with real numbers on accuracy, latency, cost, and user feedback. No vibes, no anecdotes.
Fewer incidents that touch customers. Across our deployments, customer-facing AI incidents dropped roughly 70% in the six months after we standardized this stack.
A path to improving the agent over time. The logs and feedback become the training data for the next prompt, the next tool, the next version of the agent.

Build agents that don't quietly break

Most teams ship an AI agent the same way they ship a feature. They test it before launch, click around, get a green light, and move on. Three months later, performance has slipped 20% and nobody noticed until a customer escalates.

The teams that win with AI treat agents like products, not projects. They invest in evals before they invest in features. They watch quality the same way they watch revenue. An AI agent without monitoring is a liability with a friendly UI.

If you have agents running in production and no clear answer to "how do you know they are still working," that is the most important problem to solve this quarter. Ours is one path. Whatever stack you pick, pick one.

Want to talk through what to measure on your own agents? Book a 30-minute call and we will walk through your setup with you.

Share this article

Inside How We Monitor AI Agents in Production

The reason most AI agents quietly fail

What we actually monitor

Tool call accuracy

Output grounding

Latency and cost per task

User feedback signals

The four-layer monitoring stack we deploy

Layer 1: Structured event logging

Layer 2: Automated evals on a golden test set

Layer 3: Anomaly detection and alerting

Layer 4: Human-in-the-loop review cadence

How we respond when something drifts

What this actually gives clients

Build agents that don't quietly break

Related articles

How We Test AI Agents Before Shipping Them to Clients

How We Use AI to Run Our Own Agency

How We Scope AI Projects at AXI (Before Writing a Line of Code)

Why Wait to
Get Started?

Let's Build Something Great

Why Wait to
Get Started?

Inside How We Monitor AI Agents in Production

The reason most AI agents quietly fail

What we actually monitor

Tool call accuracy

Output grounding

Latency and cost per task

User feedback signals

The four-layer monitoring stack we deploy

Layer 1: Structured event logging

Layer 2: Automated evals on a golden test set

Layer 3: Anomaly detection and alerting

Layer 4: Human-in-the-loop review cadence

How we respond when something drifts

What this actually gives clients

Build agents that don't quietly break

Related articles

How We Test AI Agents Before Shipping Them to Clients

How We Use AI to Run Our Own Agency

How We Scope AI Projects at AXI (Before Writing a Line of Code)

Why Wait to Get Started?

Why Wait to
Get Started?