← All notes

Building AI Agents That Actually Work

Most AI agents fail in production. Here's the framework I use to design agents that are reliable, predictable, and genuinely useful.

There’s a gap between an AI agent that impresses in a demo and one that ships to production. I’ve built both. The difference usually comes down to a few foundational decisions made early in the design process.

Start with the failure modes

Before writing a single line of code, I map out how the agent can break. Not the happy path — that’s easy. The interesting questions are:

  • What happens when the LLM returns a malformed response?
  • What happens when the tool call fails or times out?
  • What happens when the user’s intent is genuinely ambiguous?

An agent that handles failure gracefully is one you can actually trust. Build the error handling first, then build the capability.

Give the agent a narrow scope

The best agents I’ve built do one thing very well. The worst ones try to be general-purpose assistants.

A good rule of thumb: if you can’t summarize what the agent does in a single sentence, the scope is too broad. Narrow scope means fewer failure modes, easier testing, and — counterintuitively — a better user experience. Users trust a focused tool more than an agent that tries to do everything.

Design the tool interface carefully

The tools you give an agent are its interface with the world. Poorly designed tools lead to poor agent behavior, even with a capable underlying model.

Good tool design means:

  • Clear, specific function names and descriptions
  • Input schemas that are hard to misuse
  • Return values that carry enough context to be useful
  • Errors that are descriptive, not just status codes

When the model reads your tool definitions, it should understand exactly what each tool does and when to use it.

Treat prompts like code

Your system prompt is the most important piece of code in your agent. It deserves the same review, version control, and testing discipline as any other critical component.

I keep system prompts in version control alongside the code, write tests that exercise edge cases in the prompt, and treat prompt changes as deployments — they ship with the same care as a code change.

Test with adversarial inputs

Production users will ask things you didn’t anticipate. Build a test suite that includes:

  • Inputs that are technically valid but semantically strange
  • Inputs designed to trigger unexpected tool calls
  • Inputs that push the agent toward its failure modes

The goal isn’t 100% coverage. It’s building intuition for where the cracks are before a user finds them.


Building AI agents is still a young discipline. The frameworks are evolving, the best practices are being discovered in real time, and most of what you’ll learn comes from shipping things and watching them break. The fundamentals above have served me well across a dozen different agent projects — I hope they’re useful to you too.