Observability for AI Agents: What to Monitor When Your Code Calls an LLM

When a traditional API call fails, you check the status code, look at the error message, and fix it. When an LLM-powered agent produces a wrong answer or behaves unexpectedly, the problem could be anywhere: the prompt, the retrieved context, a tool call result, the model's reasoning, the output parser. Without observability, debugging is guesswork.

Traditional APM tools Datadog, New Relic, even basic logging weren't built for this. They can tell you the API call took 3.2 seconds, but not why the agent made the wrong decision, which tool call produced the bad data, or how much that one agent run cost across all its model calls.

This post covers what observability actually means for LLM systems and how to instrument production agents properly from day one.

What's Different About LLM Observability

Standard application observability tracks: latency, error rate, throughput. These still matter, but they're insufficient for LLM systems because:

Non-determinism. The same input can produce different outputs. You can't reproduce a bad output just by re-running the same request. You need the exact context window that produced it.

Multi-step execution. A single user request might involve 5-10 LLM calls and 15 tool calls, each building on the last. A problem in step 3 might not surface until step 9. You need trace-level visibility, not just request-level.

Cost is a first-class metric. Each LLM call has a direct financial cost (input tokens × price + output tokens × price). An agent with a memory management bug can cost 100× more than it should. You need per-run and per-user cost visibility.

Output quality is not binary. An agent that produces a wrong answer isn't a 500 error it's a wrong answer that looks correct. You need a mechanism to evaluate and flag outputs, not just track whether the API call succeeded.

The Core Observability Stack

1. Distributed Tracing with LangFuse

LangFuse is purpose-built for LLM observability. It wraps your LLM calls and captures: the prompt, the completion, token counts, latency, cost, and the parent-child relationship between calls in a multi-step agent run.

Basic instrumentation:

import Langfuse from 'langfuse'
 
const langfuse = new Langfuse({
  secretKey: process.env.LANGFUSE_SECRET_KEY,
  publicKey: process.env.LANGFUSE_PUBLIC_KEY,
  baseUrl: 'https://cloud.langfuse.com',
})
 
async function runAgentTask(userId: string, task: string) {
  const trace = langfuse.trace({
    name: 'agent-task',
    userId,
    input: { task },
    metadata: { taskType: classifyTask(task) },
  })
 
  try {
    const result = await executeAgentLoop(task, trace)
 
    trace.update({
      output: { result },
      metadata: { outcome: 'success' },
    })
 
    return result
  } catch (error) {
    trace.update({
      metadata: { outcome: 'failure', error: String(error) },
    })
    throw error
  } finally {
    await langfuse.flushAsync()
  }
}

Each LLM call inside executeAgentLoop creates a child span on the trace, capturing the exact prompt and completion:

async function callLLM(
  messages: Message[],
  parentTrace: LangfuseTrace
): Promise<string> {
  const generation = parentTrace.generation({
    name: 'llm-call',
    model: 'gpt-4o',
    input: messages,
  })
 
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages,
  })
 
  const output = response.choices[0].message.content ?? ''
 
  generation.end({
    output,
    usage: {
      promptTokens: response.usage?.prompt_tokens,
      completionTokens: response.usage?.completion_tokens,
    },
  })
 
  return output
}

Now every agent run produces a full trace in LangFuse: which LLM calls happened, in what order, what was in each prompt, what the model returned, and how much it cost.

💡Start Tracing Before Going Live

Retrofitting observability into an existing agent codebase is painful you have to thread trace objects through every function call. Instrument from the start, even if you're not watching the traces yet. You'll thank yourself the first time something goes wrong in production.

2. Cost Tracking per User and Session

Token cost needs to be tracked at the user/session level, not just globally. An agent with a memory leak that's doubling context windows will look fine in aggregate metrics but will be burning $50/session for specific users.

interface RunMetrics {
  traceId: string
  userId: string
  totalInputTokens: number
  totalOutputTokens: number
  totalCostUsd: number
  llmCallCount: number
  toolCallCount: number
  durationMs: number
}
 
const MODEL_COSTS = {
  'gpt-4o': { input: 0.0025 / 1000, output: 0.01 / 1000 },
  'gpt-4o-mini': { input: 0.00015 / 1000, output: 0.0006 / 1000 },
  'claude-opus-4-6': { input: 0.015 / 1000, output: 0.075 / 1000 },
}
 
function calculateRunCost(
  model: keyof typeof MODEL_COSTS,
  inputTokens: number,
  outputTokens: number
): number {
  const costs = MODEL_COSTS[model]
  return costs.input * inputTokens + costs.output * outputTokens
}

Track these metrics per run, store them in your database, and set up alerts for any user session that exceeds a cost threshold.

3. Tool Call Success Rates

Tool calls fail for different reasons than LLM calls: network errors, authentication failures, rate limits on external APIs, invalid inputs from the model. Track tool call success rates separately from LLM call success rates.

async function trackedToolCall<T>(
  toolName: string,
  toolFn: () => Promise<T>,
  parentTrace: LangfuseTrace
): Promise<T> {
  const span = parentTrace.span({
    name: `tool:${toolName}`,
    startTime: new Date(),
  })
 
  const startTime = Date.now()
 
  try {
    const result = await toolFn()
 
    span.end({
      output: { success: true },
      metadata: { durationMs: Date.now() - startTime },
    })
 
    return result
  } catch (error) {
    span.end({
      output: { success: false, error: String(error) },
      level: 'ERROR',
      metadata: { durationMs: Date.now() - startTime },
    })
 
    throw error
  }
}

If search_web has a 40% failure rate and query_database has a 2% failure rate, you want to see that distinction not just an aggregate agent failure rate.

4. Output Quality Evaluation

The hardest observability problem: how do you know if an agent's output was actually correct? There are three practical approaches:

Human review sampling. Flag a percentage of outputs for human review. Even 2-5% sampling gives you signal on output quality trends over time.

LLM-as-judge. Use a secondary LLM call to evaluate the output against criteria you define:

async function evaluateOutput(
  task: string,
  output: string
): Promise<{ score: number; issues: string[] }> {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini', // cheaper model for evaluation
    messages: [
      {
        role: 'system',
        content: 'You are a quality evaluator. Score outputs 0-1 and list issues.',
      },
      {
        role: 'user',
        content: `Task: ${task}\nOutput: ${output}\n\nEvaluate the output quality.`,
      },
    ],
    response_format: zodResponseFormat(
      z.object({
        score: z.number().min(0).max(1),
        issues: z.array(z.string()),
      }),
      'evaluation'
    ),
  })
 
  return response.choices[0].message.parsed!
}

User feedback signals. If users can thumbs-up/down agent outputs, that feedback is your most reliable quality signal. Route it back to LangFuse as evaluation scores.

The Minimum Viable Dashboard

When you're starting out, focus on these five metrics before anything else:

Metric	Why It Matters
P95 latency per agent type	Identifies which agents are slow
Cost per successful run	Catches memory/context bugs early
Tool call failure rate by tool	Pinpoints reliability bottlenecks
LLM call count per run	Detects runaway agent loops
Output quality score (sampled)	The metric that actually matters

LangFuse has built-in dashboards for the first four. For output quality, use the evaluation API to push scores back after human or LLM-based review.

The overhead of instrumenting observability from day one is low. The cost of not having it when something goes wrong at 2am and you're debugging a production agent with no visibility into what it was doing is much higher. Build it in before you need it.