Module 4, Lecture 4.2 | Section 2: Working with LLMs in Practice
Lecture 4.1 established the measurement foundation: how to count tokens and when to act (at ~80% of the model's context window). This lecture covers what to do when that threshold is crossed. Three strategies are available, ranging from a one-line implementation to a full LLM-powered summarization. Most production agents use a blend of all three.
The simplest approach: always keep the system prompt and the last N messages. Drop everything older.
def sliding_window(messages, system, max_messages=20):
return [system] + messages[-max_messages:]
The sliding window has three properties that make it attractive for simple use cases:

The failure mode is equally clear: anything before the window is permanently lost. For tasks where users state their goals at the start of a conversation, or where the agent established a multi-step plan early on, dropping the earliest messages can destroy coherence. The agent no longer knows what it agreed to do or what decisions were made.
Good fit: Short tasks, stateless interactions (Q&A, simple customer service bots), applications where recent messages are the only relevant ones, cost-sensitive applications where extra API calls are not acceptable.
Poor fit: Multi-step tasks where early context contains the plan; any conversation where the user stated an objective at the start that needs to govern the entire session.
Sliding window is most appropriate for chatbot-style interactions. For agents that read files, run commands, and execute multi-step plans, it is too lossy.
Rather than assuming recency equals importance, selective preservation classifies messages by their value and keeps only the meaningful ones.
The core intuition: not all messages are equal. A user message saying "investigate this bug" carries less long-term value than the agent's discovery that the database schema has a column name mismatch. A raw file listing that was read and acted on three steps ago carries little value. A decision about the approach to take carries a great deal.
Keep:
Drop:
The simplest form uses role-based rules and recency together:
def selective_preserve(messages, system, keep_last=10):
important = [m for m in messages[:-keep_last]
if m.get("important") or m["role"] == "system"]
recent = messages[-keep_last:]
return [system] + important + recent
This keeps all messages flagged as important (regardless of age) plus the most recent exchanges. The flagging can be done by:
The LLM classification approach adds cost — one extra API call per message — but enables nuanced judgments. An LLM can recognize that a message contains a key constraint ("the user wants us to avoid modifying the database schema") even if it was stated casually. It can also identify when a message supersedes an earlier one.
Selective preservation is more powerful than a sliding window because it decouples importance from age. It is more expensive than a sliding window because, when using LLM-based classification, it requires extra API calls.
Compaction is the most powerful strategy: use the LLM itself to summarize the full conversation, then replace the entire conversation history with the summary.
The mechanism:
The trade-off is explicit: you spend tokens now (the summarization call) to save tokens later (all future calls use the compact summary instead of the full history). A conversation that grew to 50,000 tokens can be compacted to 2,000 tokens with well-designed prompts — a 96% reduction in future input token cost.

The quality of compaction depends almost entirely on the summarization prompt. A good prompt tells the LLM exactly what to keep and what to discard:
Preserve:
Discard:
The compaction_demo.py script demonstrates this pattern on a realistic debugging conversation. The original 18-message conversation — simulating an agent that read source files, examined logs, identified a column name mismatch, applied fixes across multiple files, and ran tests — contains roughly 1,100 tokens. After compaction, the summary is 293 tokens. Context reduced by ~75%.
The key fragment from the demo:
def compact_conversation(messages):
response = client.messages.create(
model=MODEL,
max_tokens=1024,
messages=[{
"role": "user",
"content": COMPACTION_PROMPT + format_conversation(messages)
}]
)
summary = response.content[0].text
return [
{"role": "user", "content": f"[Prior conversation summary]\n{summary}"},
{"role": "assistant", "content": "Understood. I have the context from the prior conversation and I'm ready to continue."}
]
The compacted conversation is two messages: the summary injected as a user message, and an assistant acknowledgment. All subsequent turns build from this new baseline. The full script is in compaction_demo.py.
OpenClaw is an open-source general-purpose agent (~247K GitHub stars) that demonstrates compaction at production scale. OpenClaw manages long-running conversations — days or weeks of interactions — where a user sends requests to an agent running on their personal computer: check email, review calendar, manage documents.
When OpenClaw's context approaches its limit, it triggers a silent turn — an internal LLM call that is not visible to the user. The silent turn asks the LLM to summarize the conversation and extract the most important facts into durable memory (Markdown files written to disk). The conversation restarts with a compact summary as context. The durable memory files survive reboots.
From the user's perspective, nothing happened. The agent simply keeps working. The silent turn is invisible.
This architecture demonstrates the "summarize and restart" pattern scaled to a real product. The specific compaction logic varies — OpenClaw uses local models, writes memory to files, and maintains an index — but the core idea is the same pattern demonstrated in the demo script.
| Sliding Window | Selective Preservation | Compaction | |
|---|---|---|---|
| Simplicity | One line of code | Moderate — needs classification | Complex — extra API call |
| Information loss | Drops everything old | Keeps tagged items | LLM decides what matters |
| Cost | Free | Free (rule-based) or cheap (LLM) | Costs a summarization call |
| Best for | Short/stateless tasks | Multi-step with key milestones | Long-running complex tasks |
In practice, production agents rarely use a single strategy in isolation. A common pattern:
The sliding window is most useful when the agent's task structure guarantees that older messages are genuinely irrelevant — stateless Q&A, simple chatbots, scenarios where each turn is largely independent.
The right strategy depends on what the agent is doing. A general-purpose agent like OpenClaw that manages your calendar, email, and files over weeks demands different context management than a single-session coding assistant. Matching the strategy to the task structure is the core engineering judgment.