Skip to content

How I Built a Production AI Agent: Architecture, Failures, and What Actually Worked

17 min read3376 words

Last year I shipped an AI agent that cost us $400 in a single afternoon. It got stuck in a loop, calling the same API endpoint over and over, burning tokens on each iteration while producing nothing useful. The monitoring dashboard lit up, but by the time someone noticed, the damage was done.

That failure taught me more than any tutorial or conference talk about agent architecture. This post is the full story — the naive architecture I started with, the failures that forced redesigns, and the architecture that has been running in production for months without incident.

If you are building AI agents and wondering why your demo works perfectly but your production system keeps breaking, this is for you. For broader context on where agents fit into AI Engineering as a discipline, see my AI Engineering practitioner's guide.


The Context: Why I Needed a Production Agent

The project was a customer support automation system. Not a chatbot that answers FAQs — an actual agent that could look up orders, check shipping status, process returns, escalate to humans, and update CRM records. It needed to handle 200+ concurrent conversations with sub-3-second response times.

The demo I built in two days worked great. It could handle the happy path for any scenario I threw at it. The gap between that demo and production turned out to be about four months of engineering work.


Architecture v1: The Naive Approach

My first architecture was the one you see in every tutorial: a simple while loop that calls the LLM, checks if it wants to use a tool, executes the tool, and feeds the result back.

// v1: The architecture that cost us $400
async function runAgent(userMessage: string): Promise<string> {
  const messages: Message[] = [
    { role: 'system', content: systemPrompt },
    { role: 'user', content: userMessage }
  ];
 
  while (true) {
    const response = await llm.chat({ messages, tools });
 
    if (response.stopReason === 'end_turn') {
      return response.content;
    }
 
    for (const toolCall of response.toolCalls) {
      const result = await executeTool(toolCall);
      messages.push({ role: 'tool', content: result, toolCallId: toolCall.id });
    }
  }
}

This works in a demo because:

  • Inputs are predictable
  • The LLM always converges on an answer
  • External APIs always respond
  • Nobody is trying to break it
  • Cost does not matter

In production, every one of those assumptions failed within the first week.

What Went Wrong

Infinite loops. A customer asked about an order that had been split into sub-orders. The agent looked up the parent order, saw references to child orders, looked those up, saw references back to the parent, and cycled forever. No iteration limit. No token budget. Just a while (true) that lived up to its name.

Token burn. Each iteration appended the full tool response to the message array. For a customer with a long order history, the context window filled up with JSON data. The agent kept running, but every subsequent call cost more tokens because the input grew linearly with each loop.

Silent failures. When the shipping API returned a 503, the agent received an error message, tried a different approach (which also failed), and then hallucinated a shipping status. The customer got a confident, wrong answer. No alert fired because the agent technically "completed" its task.

Tool confusion. I had 15 tools registered. The agent sometimes called update_order when it should have called get_order_status. In a demo, you catch this because you are watching. In production at 200 concurrent conversations, you do not.


Architecture v2: Planner, Executor, Observer

After the $400 incident, I rebuilt the architecture around three distinct stages in each loop iteration.

interface AgentState {
  messages: Message[];
  plan: string | null;
  iterationCount: number;
  tokenUsage: { input: number; output: number };
  toolCallHistory: ToolCallRecord[];
}
 
interface AgentConfig {
  maxIterations: number;       // hard stop at 25
  maxTokenBudget: number;      // total tokens before forced completion
  toolTimeout: number;         // per-tool timeout in ms
  repetitionThreshold: number; // same tool call N times = circuit break
}

The loop now has three phases:

Phase 1: Plan

Before executing anything, the agent creates or updates a plan. This uses a lightweight model call with a focused prompt.

async function planStep(state: AgentState, config: AgentConfig): Promise<Plan> {
  const planResponse = await llm.chat({
    model: 'claude-haiku-4-20250514', // cheap model for planning
    messages: [
      { role: 'system', content: plannerPrompt },
      { role: 'user', content: JSON.stringify({
        originalRequest: state.messages[1].content,
        currentPlan: state.plan,
        completedActions: state.toolCallHistory.map(summarizeToolCall),
        iteration: state.iterationCount,
        remainingBudget: config.maxTokenBudget - totalTokens(state.tokenUsage)
      })}
    ],
    maxTokens: 300
  });
 
  return parsePlan(planResponse.content);
}

The planner outputs a structured plan: what to do next, what tools are needed, and whether the task is complete. Using a smaller model here saves significant cost — planning does not require the same reasoning depth as execution.

Phase 2: Execute

The executor takes the plan and runs the next step. This is where the expensive model call happens, but it is now scoped to a single action rather than an open-ended conversation.

async function executeStep(
  state: AgentState,
  plan: Plan,
  config: AgentConfig
): Promise<ExecutionResult> {
  // Select tools relevant to current plan step
  const relevantTools = selectToolsForStep(plan.nextStep, allTools);
 
  const response = await llm.chat({
    model: 'claude-sonnet-4-6-20250514', // capable model for execution
    messages: state.messages,
    tools: relevantTools, // only 3-5 tools, not all 15
    maxTokens: 1024
  });
 
  // Execute tool calls with timeout and error handling
  const toolResults = await Promise.allSettled(
    response.toolCalls.map(tc =>
      executeToolWithGuardrails(tc, config.toolTimeout)
    )
  );
 
  return { response, toolResults };
}

The critical change: tool filtering. Instead of giving the agent all 15 tools every time, I give it 3-5 tools relevant to the current plan step. This cut tool selection errors by 80%.

Phase 3: Observe

The observer decides what happens next: continue, complete, or escalate.

async function observeStep(
  state: AgentState,
  executionResult: ExecutionResult,
  config: AgentConfig
): Promise<'continue' | 'complete' | 'escalate'> {
  // Check hard limits
  if (state.iterationCount >= config.maxIterations) return 'escalate';
  if (totalTokens(state.tokenUsage) >= config.maxTokenBudget) return 'escalate';
 
  // Check for repetitive behavior
  if (detectRepetition(state.toolCallHistory, config.repetitionThreshold)) {
    return 'escalate';
  }
 
  // Check for tool failures
  const failedTools = executionResult.toolResults
    .filter(r => r.status === 'rejected');
  if (failedTools.length > 0 && state.iterationCount > 3) {
    return 'escalate';
  }
 
  // Check if the plan says we are done
  if (executionResult.response.stopReason === 'end_turn') {
    return 'complete';
  }
 
  return 'continue';
}

Key Failures and What They Taught Me

Failure 1: Infinite Loops and Token Burn

What happened: Agent stuck in circular tool calls. Each iteration cost more tokens as context grew.

Root cause: No iteration limit, no token tracking, no repetition detection.

Fix: Three-layer protection.

LayerMechanismThreshold
Iteration capHard loop limit25 iterations max
Token budgetRunning total of input + output tokens50K tokens per conversation
Repetition detectorTrack tool name + argument hashes3 identical calls = circuit break
function detectRepetition(
  history: ToolCallRecord[],
  threshold: number
): boolean {
  const recent = history.slice(-threshold * 2);
  const callSignatures = recent.map(
    r => `${r.toolName}:${hashArguments(r.arguments)}`
  );
 
  const counts = new Map<string, number>();
  for (const sig of callSignatures) {
    counts.set(sig, (counts.get(sig) ?? 0) + 1);
    if (counts.get(sig)! >= threshold) return true;
  }
  return false;
}

Failure 2: Tool Selection Confusion

What happened: With 15 tools available, the agent picked wrong tools 12% of the time. update_order instead of get_order_status. search_customer when it already had the customer ID.

Root cause: Too many tools in context creates decision fatigue for the model. Tool descriptions were ambiguous.

Fix: Dynamic tool filtering based on the current plan step, plus improved tool descriptions with explicit "when to use" and "when NOT to use" fields.

interface ProductionTool<TInput, TOutput> {
  name: string;
  description: string;
  whenToUse: string;    // "Use when you need the current status of a specific order"
  whenNotToUse: string; // "Do NOT use for updating orders or looking up customer info"
  inputSchema: z.ZodSchema<TInput>;
  execute: (input: TInput) => Promise<TOutput>;
  category: 'lookup' | 'mutation' | 'search' | 'escalation';
}
 
function selectToolsForStep(
  step: PlanStep,
  allTools: ProductionTool<any, any>[]
): ProductionTool<any, any>[] {
  // Map plan step types to tool categories
  const categoryMap: Record<string, string[]> = {
    'gather_info': ['lookup', 'search'],
    'take_action': ['mutation'],
    'escalate': ['escalation'],
  };
 
  const allowedCategories = categoryMap[step.type] ?? ['lookup', 'search'];
  return allTools.filter(t => allowedCategories.includes(t.category));
}

After implementing tool filtering, misselection dropped from 12% to under 2%. I wrote about tool calling patterns in more depth in my post on tool calling patterns for reliable agents.

Failure 3: Prompt Drift Over Long Conversations

What happened: In conversations that went past 10 messages, the agent started ignoring its system prompt instructions. It became overly chatty, stopped following output format requirements, and occasionally broke character.

Root cause: Long context means the system prompt has less influence on the model's behavior. The most recent messages dominate.

Fix: Instruction reinforcement. I inject a condensed version of critical instructions at regular intervals in the conversation.

function reinforceInstructions(messages: Message[]): Message[] {
  const reinforcement: Message = {
    role: 'user',
    content: `[SYSTEM REMINDER] You are a customer support agent. Follow these rules:
1. Always verify order details before taking action
2. Never fabricate shipping status — if lookup fails, say so
3. Respond in under 100 words unless the customer asks for details
4. If unsure, escalate to human support`
  };
 
  const result = [...messages];
  // Insert reinforcement every 8 messages
  for (let i = 8; i < result.length; i += 9) {
    result.splice(i, 0, reinforcement);
  }
  return result;
}

This is not elegant, but it works. Prompt drift cost us customer trust before we caught it — the agent was giving verbose, unfocused answers that made it obvious it was an AI struggling rather than a capable system.

Failure 4: External API Failures Mid-Loop

What happened: The shipping provider API went down for 20 minutes. During that time, the agent received error responses, interpreted them as "no shipping information found," and told customers their orders had not shipped — when they actually had.

Root cause: The agent treated API errors as valid data. No distinction between "information not found" and "could not retrieve information."

Fix: Structured error handling that separates tool execution failures from legitimate empty results.

type ToolResult<T> =
  | { status: 'success'; data: T }
  | { status: 'not_found'; message: string }
  | { status: 'error'; error: string; retryable: boolean };
 
async function executeToolWithGuardrails<T>(
  toolCall: ToolCall,
  timeout: number
): Promise<ToolResult<T>> {
  try {
    const result = await Promise.race([
      registeredTools[toolCall.name].execute(toolCall.arguments),
      new Promise<never>((_, reject) =>
        setTimeout(() => reject(new Error('Tool timeout')), timeout)
      )
    ]);
 
    if (result === null || result === undefined) {
      return { status: 'not_found', message: `No data found for given parameters` };
    }
    return { status: 'success', data: result };
  } catch (error) {
    return {
      status: 'error',
      error: `Tool execution failed: ${error.message}`,
      retryable: isRetryableError(error)
    };
  }
}

The agent's system prompt explicitly instructs it: "If a tool returns an error status, tell the customer you are unable to retrieve the information right now. Never guess or infer data that a tool failed to provide."


The Architecture That Survived Production

Here is the full agent loop as it runs today:

async function runProductionAgent(
  userMessage: string,
  conversationHistory: Message[],
  config: AgentConfig
): Promise<AgentResponse> {
  const state: AgentState = {
    messages: [
      { role: 'system', content: systemPrompt },
      ...conversationHistory,
      { role: 'user', content: userMessage }
    ],
    plan: null,
    iterationCount: 0,
    tokenUsage: { input: 0, output: 0 },
    toolCallHistory: []
  };
 
  const traceId = generateTraceId();
  logAgentStart(traceId, userMessage);
 
  while (state.iterationCount < config.maxIterations) {
    state.iterationCount++;
 
    // Phase 1: Plan (cheap model)
    const plan = await planStep(state, config);
    logPlanStep(traceId, plan);
 
    if (plan.isComplete) {
      return finalizeResponse(state, traceId);
    }
 
    // Phase 2: Execute (capable model, filtered tools)
    const result = await executeStep(state, plan, config);
    updateTokenUsage(state, result);
    recordToolCalls(state, result);
    logExecutionStep(traceId, result);
 
    // Phase 3: Observe (decide next action)
    const decision = await observeStep(state, result, config);
    logObservation(traceId, decision);
 
    if (decision === 'complete') {
      return finalizeResponse(state, traceId);
    }
    if (decision === 'escalate') {
      return escalateToHuman(state, traceId);
    }
 
    // Reinforce instructions if conversation is long
    if (state.messages.length > 8) {
      state.messages = reinforceInstructions(state.messages);
    }
  }
 
  // Should not reach here, but safety net
  return escalateToHuman(state, traceId);
}

Guardrails: Input, Output, and Circuit Breakers

Guardrails are not optional. Every production agent needs three types:

Input Validation

Validate user input before it reaches the agent. This catches prompt injection, malformed requests, and out-of-scope queries.

async function validateInput(input: string): Promise<ValidationResult> {
  // Length check
  if (input.length > 5000) {
    return { valid: false, reason: 'Input exceeds maximum length' };
  }
 
  // Injection detection (keyword-based + classifier)
  if (detectInjectionAttempt(input)) {
    return { valid: false, reason: 'Potential prompt injection detected' };
  }
 
  // Scope check — is this something our agent should handle?
  const scopeCheck = await llm.chat({
    model: 'claude-haiku-4-20250514',
    messages: [{
      role: 'user',
      content: `Is this a customer support question about orders, shipping, or returns? Answer YES or NO.\n\nQuery: ${input}`
    }],
    maxTokens: 10
  });
 
  if (scopeCheck.content.includes('NO')) {
    return { valid: false, reason: 'Out of scope' };
  }
 
  return { valid: true };
}

Output Validation

Check agent responses before they reach the customer. This catches hallucinations, format violations, and policy breaches.

async function validateOutput(
  output: string,
  context: AgentState
): Promise<{ approved: boolean; issues: string[] }> {
  const issues: string[] = [];
 
  // Check for fabricated order numbers
  const orderNumbers = extractOrderNumbers(output);
  const knownOrders = extractOrderNumbers(
    JSON.stringify(context.toolCallHistory)
  );
  const fabricated = orderNumbers.filter(o => !knownOrders.includes(o));
  if (fabricated.length > 0) {
    issues.push(`Fabricated order numbers: ${fabricated.join(', ')}`);
  }
 
  // Check response length
  if (output.length > 2000) {
    issues.push('Response exceeds maximum length');
  }
 
  // Policy check — no promises about refund timelines, no sharing internal info
  const policyViolations = await checkPolicyCompliance(output);
  issues.push(...policyViolations);
 
  return { approved: issues.length === 0, issues };
}

Circuit Breakers

Circuit breakers stop the agent before damage occurs. They operate at the conversation level and at the system level.

Circuit BreakerScopeTriggerAction
Repetition detectorConversation3 identical tool callsEscalate to human
Token budgetConversation50K tokens exceededForce completion
Error rateSystem-wide>20% of conversations escalatingAlert + reduce throughput
Cost spikeSystem-wideHourly cost >2x baselinePause new conversations

Observability: What to Log and How to Trace

You cannot debug a production agent without traces. I log every decision the agent makes, structured for querying.

interface AgentTraceEvent {
  traceId: string;
  conversationId: string;
  timestamp: number;
  eventType: 'plan' | 'tool_call' | 'tool_result' | 'observation' | 'response';
  model: string;
  tokensUsed: { input: number; output: number };
  latencyMs: number;
  data: Record<string, unknown>;
}
 
function logToolCall(traceId: string, toolCall: ToolCall, result: ToolResult<any>) {
  emit({
    traceId,
    conversationId: getConversationId(),
    timestamp: Date.now(),
    eventType: 'tool_call',
    model: 'n/a',
    tokensUsed: { input: 0, output: 0 },
    latencyMs: result.latencyMs,
    data: {
      toolName: toolCall.name,
      arguments: toolCall.arguments,
      resultStatus: result.status,
      resultSize: JSON.stringify(result).length
    }
  });
}

What I actually query for in production:

  • Conversations that escalated — why did the agent give up? What was the last tool call?
  • Conversations over 5 iterations — what caused the extra loops?
  • Tool failure rates — which tools are unreliable? What error patterns exist?
  • Token usage distribution — are some conversation types disproportionately expensive?
  • Latency outliers — which tool calls are slow? Is there a provider issue?

I use Langfuse for trace visualization, but the key is structuring your logs so you can answer these questions with any tool.


Cost Management: Token Budgets and Model Routing

Cost is a first-class concern for production agents. Here is how I keep it under control.

Model Routing

Not every LLM call needs the most capable model. I route based on the task:

TaskModelReason
PlanningClaude HaikuLow reasoning requirement, high frequency
Tool selection and executionClaude SonnetNeeds accurate tool calling
Complex reasoningClaude OpusRare, only for ambiguous cases
Input/output validationClaude HaikuClassification task, fast and cheap

This routing reduced our average cost per conversation by 60% compared to using Sonnet for everything.

Token Budgets

Every conversation has a token budget. The planner is aware of the remaining budget and adjusts its plan accordingly.

function adjustPlanForBudget(plan: Plan, remainingTokens: number): Plan {
  if (remainingTokens < 10000) {
    // Low budget: skip optional lookups, give direct answer
    return {
      ...plan,
      steps: plan.steps.filter(s => s.priority === 'required'),
      note: 'Budget constrained — essential actions only'
    };
  }
  return plan;
}

Monthly Cost Tracking

I track cost at three levels: per conversation, per customer, and per tool. This revealed that 5% of conversations consumed 40% of our token budget — all long, complex multi-order inquiries. We added a fast path for simple queries (order status, tracking number) that skips the planner entirely and uses a single Haiku call.


Evaluation: Is Your Agent Actually Working?

The hardest part of production agents is measuring whether they work. Unlike traditional software, there is no binary pass/fail. The agent might give a technically correct but unhelpful answer, or a helpful but slightly inaccurate one.

I evaluate on four axes:

MetricHow MeasuredTarget
Task completion rateDid the agent resolve the issue without escalation?>85%
AccuracySpot-check agent answers against ground truth>95%
Customer satisfactionPost-conversation survey>4.2/5
Cost per resolutionTotal token cost / resolved conversationsUnder $0.15

Automated Evaluation

I run a nightly eval suite against recorded conversations:

interface EvalCase {
  input: string;
  expectedTools: string[];      // tools that should be called
  expectedOutcome: 'resolved' | 'escalated';
  assertions: Array<{
    type: 'contains' | 'not_contains' | 'tool_called' | 'tool_not_called';
    value: string;
  }>;
}
 
async function runEval(cases: EvalCase[]): Promise<EvalReport> {
  const results = await Promise.all(
    cases.map(async (testCase) => {
      const response = await runProductionAgent(testCase.input, [], defaultConfig);
 
      const passed = testCase.assertions.every(assertion => {
        switch (assertion.type) {
          case 'contains':
            return response.content.includes(assertion.value);
          case 'not_contains':
            return !response.content.includes(assertion.value);
          case 'tool_called':
            return response.toolCallHistory.some(
              tc => tc.toolName === assertion.value
            );
          case 'tool_not_called':
            return !response.toolCallHistory.some(
              tc => tc.toolName === assertion.value
            );
        }
      });
 
      return { testCase, passed, response };
    })
  );
 
  return generateReport(results);
}

This catches regressions when prompts change, tools update, or model versions shift. It does not catch everything — some failures are only visible in real conversations — but it catches the obvious breaks before they reach customers.


What I Would Do Differently

If I started this project today, three things would change:

  1. Start with the observer pattern from day one. The planner-executor-observer architecture is not premature optimization. It is table stakes for production agents. I wasted time debugging the naive loop when I should have invested that time in proper architecture.

  2. Build the eval suite before the agent. Define what "working" means before you write the agent code. It is much harder to retrofit evaluation onto an agent that is already in production.

  3. Limit tools to 5-7 per agent. If your agent needs more tools, split it into multiple specialized agents with a router in front. One general-purpose agent with 15 tools will always be less reliable than three focused agents with 5 tools each.


Summary

Building a production AI agent is mostly not about the LLM. It is about everything around the LLM: guardrails, observability, cost controls, evaluation, and error handling. The model is the easy part. The system is the hard part.

The architecture that works: plan with a cheap model, execute with a capable model on filtered tools, observe with hard limits and circuit breakers. Log everything. Evaluate constantly. Treat token budgets as seriously as you treat compute budgets.

If your agent works in a demo but fails in production, the problem is almost certainly in the gap between "the model can do this" and "the system reliably does this at scale." That gap is where AI Engineering lives.

For a deeper look at the tool calling layer specifically, see my post on tool calling patterns for reliable AI agents. And if you are wondering where agent engineering fits in the broader AI Engineering landscape, start with the AI Engineering practitioner's guide.