Lecture 4.2: Context Management Strategies

Module 4, Lecture 4.2 | Section 2: Working with LLMs in Practice

Overview

Lecture 4.1 established the measurement foundation: how to count tokens and when to act (at ~80% of the model's context window). This lecture covers what to do when that threshold is crossed. Three strategies are available, ranging from a one-line implementation to a full LLM-powered summarization. Most production agents use a blend of all three.

Strategy 1: Sliding Window

The simplest approach: always keep the system prompt and the last N messages. Drop everything older.

def sliding_window(messages, system, max_messages=20):
    return [system] + messages[-max_messages:]

The sliding window has three properties that make it attractive for simple use cases:

No LLM required — the operation is pure Python, with no extra API calls
Preserves recency — the model always sees the most recent context
Predictable cost — context size stays bounded at max_messages

Context window sliding

The failure mode is equally clear: anything before the window is permanently lost. For tasks where users state their goals at the start of a conversation, or where the agent established a multi-step plan early on, dropping the earliest messages can destroy coherence. The agent no longer knows what it agreed to do or what decisions were made.

Good fit: Short tasks, stateless interactions (Q&A, simple customer service bots), applications where recent messages are the only relevant ones, cost-sensitive applications where extra API calls are not acceptable.

Poor fit: Multi-step tasks where early context contains the plan; any conversation where the user stated an objective at the start that needs to govern the entire session.

Sliding window is most appropriate for chatbot-style interactions. For agents that read files, run commands, and execute multi-step plans, it is too lossy.

Strategy 2: Selective Preservation

Rather than assuming recency equals importance, selective preservation classifies messages by their value and keeps only the meaningful ones.

The core intuition: not all messages are equal. A user message saying "investigate this bug" carries less long-term value than the agent's discovery that the database schema has a column name mismatch. A raw file listing that was read and acted on three steps ago carries little value. A decision about the approach to take carries a great deal.

What to keep and what to drop

Keep:

System prompt (always — it defines the agent's identity and tools)
Key decisions — the agent committed to a plan, the user confirmed a direction
Current goals — what the agent is trying to accomplish right now
Recent messages — the last few exchanges for conversational continuity

Drop:

Old tool results — file contents that have already been read and acted on (the file can be re-read if needed)
Superseded plans — the agent proposed approach A, then switched to approach B; approach A can be discarded
Redundant exchanges — the same file read twice, repeated clarifications
Intermediate reasoning — chain-of-thought that led to a decision (keep the decision, drop the path)

Implementation approaches

The simplest form uses role-based rules and recency together:

def selective_preserve(messages, system, keep_last=10):
    important = [m for m in messages[:-keep_last]
                 if m.get("important") or m["role"] == "system"]
    recent = messages[-keep_last:]
    return [system] + important + recent

This keeps all messages flagged as important (regardless of age) plus the most recent exchanges. The flagging can be done by:

Role-based rules — always keep tool-call decisions, never keep raw tool results after N turns
Recency + type — keep the last 10 messages verbatim, keep all messages tagged as "decision" regardless of age
LLM classification — route each new message through a secondary LLM call (using a fast, cheap model like Haiku) that labels it as important or not

The LLM classification approach adds cost — one extra API call per message — but enables nuanced judgments. An LLM can recognize that a message contains a key constraint ("the user wants us to avoid modifying the database schema") even if it was stated casually. It can also identify when a message supersedes an earlier one.

Selective preservation is more powerful than a sliding window because it decouples importance from age. It is more expensive than a sliding window because, when using LLM-based classification, it requires extra API calls.

Strategy 3: Compaction

Compaction is the most powerful strategy: use the LLM itself to summarize the full conversation, then replace the entire conversation history with the summary.

The mechanism:

The conversation reaches the budget threshold (e.g., 80% of the model's context window)
Send the full conversation to the LLM with a summarization prompt
Receive a compact summary that preserves essential information
Replace the conversation history with the summary as a single message
Continue as if the summary were the entire prior context

The trade-off is explicit: you spend tokens now (the summarization call) to save tokens later (all future calls use the compact summary instead of the full history). A conversation that grew to 50,000 tokens can be compacted to 2,000 tokens with well-designed prompts — a 96% reduction in future input token cost.

Compaction flow

What a good compaction prompt preserves

The quality of compaction depends almost entirely on the summarization prompt. A good prompt tells the LLM exactly what to keep and what to discard:

Preserve:

Decisions made and actions taken
Current state of the task — what is complete, what is in progress, what remains
Key facts about the domain (project languages, database schema details, file paths)
Unresolved issues or follow-up items
User preferences and constraints established during the session

Discard:

Raw file contents that were read and already acted on
Full command outputs (summarize the result instead: "all 5 tests passed")
The assistant's reasoning chain (keep the conclusion, not the path to it)
Intermediate tool calls (note the outcome, not the transcript)
Superseded plans and redundant exchanges

The compaction_demo.py script demonstrates this pattern on a realistic debugging conversation. The original 18-message conversation — simulating an agent that read source files, examined logs, identified a column name mismatch, applied fixes across multiple files, and ran tests — contains roughly 1,100 tokens. After compaction, the summary is 293 tokens. Context reduced by ~75%.

The key fragment from the demo:

def compact_conversation(messages):
    response = client.messages.create(
        model=MODEL,
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": COMPACTION_PROMPT + format_conversation(messages)
        }]
    )
    summary = response.content[0].text
    return [
        {"role": "user", "content": f"[Prior conversation summary]\n{summary}"},
        {"role": "assistant", "content": "Understood. I have the context from the prior conversation and I'm ready to continue."}
    ]

The compacted conversation is two messages: the summary injected as a user message, and an assistant acknowledgment. All subsequent turns build from this new baseline. The full script is in compaction_demo.py.

Compaction in production: OpenClaw

OpenClaw is an open-source general-purpose agent (~247K GitHub stars) that demonstrates compaction at production scale. OpenClaw manages long-running conversations — days or weeks of interactions — where a user sends requests to an agent running on their personal computer: check email, review calendar, manage documents.

When OpenClaw's context approaches its limit, it triggers a silent turn — an internal LLM call that is not visible to the user. The silent turn asks the LLM to summarize the conversation and extract the most important facts into durable memory (Markdown files written to disk). The conversation restarts with a compact summary as context. The durable memory files survive reboots.

From the user's perspective, nothing happened. The agent simply keeps working. The silent turn is invisible.

This architecture demonstrates the "summarize and restart" pattern scaled to a real product. The specific compaction logic varies — OpenClaw uses local models, writes memory to files, and maintains an index — but the core idea is the same pattern demonstrated in the demo script.

Choosing a Strategy

	Sliding Window	Selective Preservation	Compaction
Simplicity	One line of code	Moderate — needs classification	Complex — extra API call
Information loss	Drops everything old	Keeps tagged items	LLM decides what matters
Cost	Free	Free (rule-based) or cheap (LLM)	Costs a summarization call
Best for	Short/stateless tasks	Multi-step with key milestones	Long-running complex tasks

In practice, production agents rarely use a single strategy in isolation. A common pattern:

Use selective preservation as a baseline — always keep the system prompt, drop old tool results after they are acted on, keep flagged decisions regardless of age
Trigger compaction when the context crosses the budget threshold
Keep the last few messages verbatim alongside the summary, as a hedge against compaction losing recent context

The sliding window is most useful when the agent's task structure guarantees that older messages are genuinely irrelevant — stateless Q&A, simple chatbots, scenarios where each turn is largely independent.

The right strategy depends on what the agent is doing. A general-purpose agent like OpenClaw that manages your calendar, email, and files over weeks demands different context management than a single-session coding assistant. Matching the strategy to the task structure is the core engineering judgment.

Key Takeaways

Sliding window — keep the last N messages; simple, free, and lossy. Works for stateless tasks; fails for multi-step agents.
Selective preservation — keep what matters based on importance, not age. Requires classifying messages, but can be done with simple rules or an LLM.
Compaction — use the LLM to summarize the conversation, then restart. Most powerful; costs a summarization call but achieves large token reductions.
Production agents blend strategies — OpenClaw's auto-compaction pattern (silent turn → summary → restart) is the standard for long-running agents.