axi
Book a Call
Want results like this?Book a Call
← Back to blog
Behind the ScenesJun 1, 20267 min read

How We Engineer Prompts for Production AI Agents

Inside AXI's prompt engineering process for production AI agents: the structure, the guardrails, and the testing loop that keeps outputs reliable at scale.

Prompt Engineering

A single badly written prompt can cost a client more than a bad model choice. We have watched a well-funded AI pilot stall for three months because the team kept tuning the model when the real problem was a 4,000-word prompt that contradicted itself in two places. Prompt engineering is the least glamorous part of building production AI agents and the part that decides whether the agent ships. After more than 1,000 projects, we treat prompts the way we treat code: versioned, reviewed, tested, and monitored. Here is exactly how we do it.

Why Prompt Engineering Is Not Just Writing Instructions

The word "prompt" makes people picture a clever sentence typed into a chat box. Production prompt engineering is closer to writing a spec for a junior employee who never asks questions and never remembers yesterday. Every assumption has to be on the page. Every edge case has to be named. Every output format has to be defined precisely enough that downstream code can parse it without guessing.

The gap between a demo prompt and a production prompt is enormous. A demo prompt works once, in front of a friendly audience, on a clean input. A production prompt has to work ten thousand times, against messy real-world data, with a measurable failure rate the business can live with. A demo prompt is judged by its best output. A production prompt is judged by its worst.

That shift in standard changes everything about how we write them. We are not optimizing for the impressive answer. We are optimizing for the narrow band of variance between the best answer and the worst one.

The Structure We Use for Every Production Prompt

We do not start from a blank page. Every production prompt at AXI follows the same skeleton, which keeps prompts readable, reviewable, and easy to debug when something breaks at 2 a.m.

  • Role and scope. One or two sentences defining what the agent is and, just as important, what it is not allowed to do. Scope boundaries prevent more failures than any other section.
  • Inputs. An explicit description of every variable the prompt receives, what format it arrives in, and what to do when one is missing or malformed.
  • Task steps. The actual instructions, written as an ordered procedure rather than a paragraph. Ordered steps are easier to test and easier to edit without breaking neighboring logic.
  • Output contract. The exact schema the agent must return, usually JSON, with a worked example. Downstream systems depend on this, so it is non-negotiable.
  • Guardrails. What to refuse, what to escalate to a human, and how to behave under uncertainty.

This structure is boring on purpose. Boring is maintainable. When a prompt is laid out the same way every time, any engineer on the team can open it, find the section that is misbehaving, and fix it without re-reading the whole thing.

The Output Contract Is the Most Important Section

If we had to keep only one section, it would be the output contract. An AI agent that returns the right answer in the wrong format is, from a software perspective, returning the wrong answer. The downstream parser breaks, the workflow halts, and the client sees an error.

We define output as strict JSON with named fields, types, and a fully worked example inside the prompt. We also tell the model what to do when it is unsure: return a confidence field and a reason rather than inventing a value. This single habit removes a large share of the silent failures that plague AI projects. It is also central to how we test AI agents before shipping them, because a stable output contract is what makes automated testing possible at all.

How We Stop Prompts From Drifting Into Chaos

The most common reason a prompt fails in production is not that it was wrong on day one. It is that it grew. Someone hits an edge case, adds a sentence to handle it, hits another, adds another sentence, and three months later the prompt is a 4,000-word swamp of contradictory instructions that no one fully understands.

We fight this with three rules.

Rule one: every prompt has an owner and a version. Prompts live in the repository next to the code, not in a Google Doc or a Slack thread. Changes go through review like any other change. A prompt edit that ships untested is treated as a bug waiting to happen.

Rule two: additions require subtractions. When we add a new instruction to handle an edge case, we check whether an existing instruction already covers it or conflicts with it. Prompts get pruned as often as they get extended. A shorter prompt that handles 95% of cases cleanly beats a sprawling one that handles 99% unreliably.

Rule three: instructions are concrete, not aspirational. "Be accurate" is not an instruction. "If the invoice total does not match the line-item sum, return needs_review with the discrepancy amount" is. Vague encouragement does nothing. Specific, testable rules do the work.

The Testing Loop

We never ship a prompt because it looked good in a few manual tries. Every production prompt goes through an evaluation set before it touches a client workflow.

The process is straightforward. We assemble 50 to 200 real input examples, including the ugly ones: malformed data, missing fields, adversarial phrasing, and the long-tail cases that broke earlier versions. We run the prompt against the full set on every change and score the outputs against expected results. A prompt change is only approved when the eval score holds or improves across the entire set, not just the example that motivated the change.

This catches the most dangerous failure mode in prompt engineering: fixing one case while quietly breaking five others. Without a held evaluation set, you are editing blind. With one, every change is measurable. The eval set is the difference between prompt engineering as a craft and prompt engineering as guesswork.

We also test across model versions. Models update, and a prompt tuned tightly to one version can degrade when the provider ships a new one. Running the eval set against candidate models before upgrading tells us whether a migration is safe long before a client notices.

What Changes When You Build Agents Instead of Single Prompts

A modern AI agent is rarely one prompt. It is a chain: a router that classifies the request, specialized prompts for each path, and sometimes a verifier prompt that checks the work of the others. This is the heart of our AI automation work, and it changes the engineering discipline.

The key principle is that each prompt in the chain should do one job well. A single mega-prompt trying to classify, extract, reason, and format all at once is impossible to debug and brittle under load. Splitting the work into focused steps means each step has its own output contract, its own eval set, and its own clear failure signal. When the chain breaks, we know exactly which link failed.

The trade-off is latency and cost, since more steps mean more model calls. We manage that by using smaller, faster models for simple routing and classification steps, and reserving the expensive reasoning models for the steps that actually need them. Matching model power to task difficulty is one of the highest-leverage decisions in the whole build.

The Takeaways

If your AI project is stalling and you keep blaming the model, look at the prompts first. In our experience the prompt layer is the cause far more often than the model.

The practical lessons after 1,000-plus projects come down to a few habits. Treat prompts as versioned, reviewed code rather than disposable text. Make the output contract strict and explicit so downstream systems can trust it. Prune prompts as aggressively as you extend them. Build a real evaluation set and let it gate every change. And when you move from single prompts to agents, give each step one clear job.

None of this is glamorous. All of it is what separates an AI demo that wows a meeting from an AI agent that quietly does real work every day for years. If you want a team that engineers prompts this way from the start, let's talk about your project.

Share this article

click the sparks to score!
Mini Game
Score0

Why Wait to Get Started?

Book a CallLet's Go 🚀
AXI automated 12 workflows today