Building Agent Memory Systems: Short-Term Context vs Long-Term Persistence

The context window is not memory. It's working memory what the agent has in front of it right now. Real memory is what the agent can access across calls, sessions, and time. Getting this distinction right is one of the most important architectural decisions you'll make when building production agents.

Most agent implementations I've seen treat the context window as the only memory store. That works until it doesn't and the failure modes are expensive: bloated context windows, degraded reasoning quality, escalating token costs, and agents that "forget" relevant information because it fell out of the window.

This post covers the three tiers of agent memory I've found most useful, with concrete implementation patterns for each.

The Three Memory Tiers

Tier 1: Working Memory (The Context Window)

Working memory is what's actively in the LLM's context right now. It includes the system prompt, current conversation history, tool call results, and any retrieved context.

Working memory is fast and immediately accessible the model can reason over everything in it without any retrieval step. But it's also bounded (32k, 128k, or 200k tokens depending on your model), expensive to fill, and ephemeral it disappears at the end of the API call.

What belongs in working memory:

The current task and its immediate sub-tasks
The last 3-5 tool call results (not all history)
Relevant retrieved facts (injected from longer-term storage)
The agent's current reasoning trace

What does not belong in working memory:

Full conversation history from previous sessions
All available tools (inject only the tools relevant to the current step)
Reference documents longer than necessary for the current task

The goal is to keep working memory dense with relevant information and sparse with everything else.

Tier 2: Episodic Memory (Session and User History)

Episodic memory is the record of past agent runs what happened, what the outcomes were, and what context existed at the time. You can't fit this in a context window for anything but the shortest sessions, so it lives externally and gets retrieved when relevant.

Implementation approach: semantic search over stored episodes

interface Episode {
  id: string
  agentRunId: string
  userId: string
  timestamp: Date
  summary: string          // LLM-generated summary of the run
  outcome: 'success' | 'failure' | 'partial'
  keyDecisions: string[]   // Important choices the agent made
  embedding: number[]      // Vector embedding of the summary
}
 
async function retrieveRelevantEpisodes(
  currentTask: string,
  userId: string,
  limit = 3
): Promise<Episode[]> {
  const taskEmbedding = await embedText(currentTask)
 
  return vectorStore.query({
    vector: taskEmbedding,
    filter: { userId },
    topK: limit,
    minScore: 0.75,
  })
}

When the agent starts a new run, you retrieve the most semantically similar past episodes for that user and inject summaries into the working memory. The agent now has access to relevant history without the context window overhead of the full transcripts.

💡Summarize Before Storing

Don't store raw conversation transcripts in your episodic memory store. Use the LLM to generate a structured summary at the end of each run key decisions, outcome, important context. The summary is what you retrieve later, and a good summary is more useful than a full transcript anyway.

Tier 3: Procedural Memory (Patterns and Instructions)

Procedural memory is the agent's knowledge of how to do things task-specific instructions, learned patterns, and operational knowledge that applies across many runs.

This is the least discussed tier but often the most impactful. Instead of encoding all procedural knowledge in the system prompt (which grows unbounded), you store it externally and retrieve the relevant slice for each task type.

interface Procedure {
  id: string
  taskType: string        // 'code_review', 'data_extraction', 'email_draft'
  title: string
  instructions: string    // Step-by-step guidance
  examples: Example[]     // Few-shot examples for this task type
  lastUpdated: Date
  successRate: number     // Tracked from production runs
}
 
async function buildSystemPrompt(
  taskType: string,
  basePrompt: string
): Promise<string> {
  const procedure = await procedures.findByTaskType(taskType)
 
  if (!procedure) return basePrompt
 
  return `${basePrompt}
 
## Task-Specific Instructions: ${procedure.title}
 
${procedure.instructions}
 
## Examples
${procedure.examples.map(e => `Input: ${e.input}\nOutput: ${e.output}`).join('\n\n')}`
}

The key property of procedural memory is that it can improve over time. When an agent run succeeds with a particularly good approach, you can update the stored procedure. When a pattern consistently fails, you can fix it centrally rather than hunting through prompts scattered across your codebase.

Memory in Practice: A Concrete Flow

Here's how these three tiers work together in a real agent run:

1. Agent receives task
2. Retrieve relevant episodic memory (past similar runs for this user)
3. Retrieve relevant procedural memory (instructions for this task type)
4. Build context window:
   - Base system prompt
   - Injected procedural instructions
   - Summaries of 2-3 relevant past episodes
   - Current task
5. Execute agent loop (tool calls, reasoning, etc.)
6. On completion:
   - Generate episode summary
   - Store episode with embedding
   - Update procedural memory if outcome warrants it

The context window never grows unbounded. It's constructed deliberately from external stores, sized to what's needed for the current task.

Choosing a Vector Store

For episodic and procedural memory, you need a vector database that can store embeddings and metadata, and support filtered semantic search (by user ID, task type, date range, etc.).

Options I've used in production:

pgvector (PostgreSQL extension): Good choice if you're already on Postgres. Familiar operational model, SQL-native filtering, no additional infrastructure. Adequate performance for most use cases.
Pinecone: Managed service, fast at scale, simple API. Good for teams that don't want to operate their own vector infrastructure.
Qdrant: Self-hosted or managed, good filtering support, performant. Open source.
Weaviate: Good if you need hybrid search (vector + keyword).

For most applications starting out, pgvector is the pragmatic choice. You can always migrate to a dedicated vector store if scale requires it.

Common Mistakes

Storing raw transcripts instead of summaries. Retrieval quality degrades when you're searching over thousands of tokens per episode. Summarize at write time.

Using the same embedding model for all content types. Embeddings for technical code reviews are in a different semantic space than embeddings for conversational messages. Consider task-specific embedding strategies.

No memory expiration policy. Old episodes become noise over time. Implement TTL policies or relevance decay so the retrieval results stay meaningful.

Treating memory as append-only. Procedural memory especially should be updatable. Build in mechanisms to correct and improve stored procedures based on production outcomes.

Memory architecture is where most agent projects accumulate technical debt because it feels like an optimization rather than a core requirement. Build the tiers deliberately from the start. Your context windows will be smaller, your token costs will be lower, and your agents will behave more consistently even as the number of users and sessions scales up.