Lab 5: Context Management Strategies

Section 4 Lab | Agent Engineering Duration: ~3 hours Prerequisites: Lab 4 + Module 7 (Lectures 7.1–7.3)

Overview

In Module 7 you learned three strategies for managing the agent's context window:

Sliding window — keep the most recent N messages, drop older ones
Selective deletion — drop old expirable messages while preserving the system prompt and tagged-as-important messages
Compaction — at a threshold, summarize the older portion of the conversation and replace the originals with the summary

Each strategy makes a different trade between coherence (does the agent remember what matters?) and efficiency (how many tokens does it use?). On a short task none of this matters. On a long task, the choice determines whether the agent loses the thread.

This lab is a controlled experiment to expose those trade-offs. You will:

Implement all three strategies in your coding agent.
Replay each strategy against three pre-built long conversations.
Run a designed set of follow-on prompts against each (strategy, conversation) pair.
Score the results and analyze which strategy wins where, and why.

The conversations and follow-ons are provided. They are engineered to make each strategy fail in characteristic ways. The goal of the lab is not to "get it working" — the goal is to see what each strategy preserves, what it loses, and what that tells you about when to reach for each one.

What's Provided

In lab05/ you will find:

conversations/ — three pre-built long conversations as JSON message arrays:
- conv_a_debug.json — a multi-file debugging session (~60 messages)
- conv_b_features.json — a multi-feature build (~80 messages)
- conv_c_refactor.json — a constrained refactor (~70 messages)
tags/ — selective-preservation tag annotations for each conversation (one file per conversation, marking each message as preserve or expirable)
followons/ — three designed follow-on prompts per conversation (9 total)
planted_facts.md — documentation of what each follow-on is designed to test, including the specific fact, its position in the conversation, and the strategies expected to pass or fail
harness.py — a test harness that loads a conversation, applies a strategy, appends a follow-on, runs your agent, and records token counts, peak context size, and the agent's response

You do not generate the conversations or tags yourself. Using identical inputs across all students makes results comparable and turns the lab into a real experiment rather than a demo.

What You'll Produce

Three strategy implementations in your agent.py:
- sliding_window(messages, n)
- selective_delete(messages, tags, preserve_recent)
- compact(messages, llm_client, threshold_tokens)
A measurement matrix — 36 runs (4 conditions × 3 conversations × 3 follow-ons), recorded in results.csv with token counts and your manual scores
An analysis writeup (analysis.md, 2–3 pages) — which strategy won where, why, and when you would use each in a real agent

Part 1: Implement the Three Strategies (~80 minutes)

1.1 Sliding Window (~15 min)

Keep the most recent N messages, always preserving the system prompt at index 0.

Signature: sliding_window(messages: list, n: int = 20) -> list

Requirements:

The system message (index 0) is always preserved.
The most recent n messages are preserved.
All other messages are dropped.
If the conversation is shorter than n + 1 messages, return it unchanged.

1.2 Selective Deletion (~25 min)

Drop expirable old messages while preserving the system prompt, tagged-as-important messages, and recent messages.

Signature: selective_delete(messages: list, tags: list, preserve_recent: int = 10) -> list

Where tags is a parallel list, one entry per message, with values "preserve" or "expirable".

Requirements:

The system message is always preserved.
Any message tagged "preserve" is kept regardless of position.
The most recent preserve_recent messages are kept regardless of tag.
All other messages are dropped.
The relative order of surviving messages is maintained.

About the tags. For this lab, the tag files are provided so the lab stays focused on strategy comparison rather than tagging policy. In a real agent, you would need to generate tags programmatically. The provided tags follow a simple policy you could implement yourself:

The system message is always preserve.
User messages are preserve — users state goals, ask questions, and give feedback. These tend to matter later in a session.
Assistant messages without a tool call are preserve — these are the model's reasoning, decisions, and explanations. Losing them loses the agent's logic chain.
Assistant messages with tool calls are expirable — the tool call itself is usually re-derivable from the result and the surrounding context.
Tool result messages are expirable by default, but can be promoted to preserve if they contain explicit decision-relevant content (a search hit the agent then acted on, an error that drove a pivot, a constraint discovered from the environment). The provided tags promote roughly 10% of tool results under this rule.

The policy above could be applied automatically at message-append time: the role and a check for tool_use blocks tells you almost everything; the only piece requiring a real heuristic is the 10% promotion of decision-relevant tool results, which a small auxiliary LLM call could classify. As you work through Part 2 you will see where this provided policy preserves the right things and where it doesn't — useful intuition for designing your own policy in later projects.

1.3 Compaction (~40 min)

When context exceeds a threshold, summarize the older portion of the conversation into a single message and replace the originals with the summary.

Signature: compact(messages: list, llm_client, threshold_tokens: int = 8000, recent_keep: int = 10) -> list

Requirements:

Count tokens in messages. If under threshold_tokens, return unchanged.
Otherwise: keep the system message, keep the last recent_keep messages, summarize everything in between via an LLM call, and return: [system, summary_message, ...recent]
The summary prompt must explicitly extract:
- Current goal(s)
- Decisions made and the reasoning behind them
- Files modified or under modification
- Unresolved issues and the next intended action
The summary message should be inserted as a user role message with content of the form: "Summary of prior conversation:\n\n{summary}"

Before moving on, run compact() on one of the provided conversations manually and read the summary it generates. If the summary is missing decisions, drops files, or hallucinates content, revise your summary prompt. A bad summary will silently destroy the rest of your experiment.

Part 2: Run the Experiment Matrix (~50 minutes)

Use the provided harness.py. For each (strategy, conversation, follow-on) combination, the harness will:

Load the conversation message array.
Apply your strategy (or none for baseline) to produce a managed message array.
Append the follow-on prompt.
Run your agent for one turn (or until end_turn).
Record: total input tokens, output tokens, peak context size, and the agent's full response.

Required runs (36 total):

Condition	Conversations	Follow-ons	Runs
Baseline (no management)	3	3	9
Sliding window (n=20)	3	3	9
Selective deletion (preserve_recent=10)	3	3	9
Compaction (threshold=8000, recent_keep=10)	3	3	9

The harness writes results to results.csv with columns: strategy, conversation, followon_id, input_tokens, output_tokens, peak_context, response_text, response_path

Note: the baseline runs with no management may exceed the model's context window on the longer conversations. If a baseline run errors out, record that as a result — context exhaustion is itself a finding.

Part 3: Score the Follow-Ons (~30 minutes)

Each follow-on is designed to test whether the agent recalled or used a specific planted fact, documented in planted_facts.md. For each row in results.csv, score the agent's response on three criteria:

Criterion	Score	Definition
Fact recall	0 or 1	Did the agent reference the planted fact correctly?
Coherence	0, 1, or 2	Does the response read as a natural continuation? (0 = confused or contradictory, 1 = adequate, 2 = clean and contextually appropriate)
Hedging	0 or 1	When the agent did not have the fact, did it appropriately surface uncertainty or admit missing context? (1 = yes; 0 = either confidently fabricated or refused unhelpfully)

Add three columns (fact_recall, coherence, hedging) to results.csv and fill in your scores.

You score these manually. There is no auto-grader. The auto-grading approach (regex/keyword matching) is brittle — the agent will paraphrase, use synonyms, or reference the fact obliquely. Manual scoring is also itself part of the lesson: you will read what the agent actually produced under each strategy, which is the closest thing to seeing the strategy's effect with your own eyes.

A short note in analysis.md later about disagreements you had with yourself (cases where coherence felt like a 1.5) is worth more than a forced binary.

Part 4: Analyze (~40 minutes)

Write analysis.md (2–3 pages) covering the following:

1. Comparison Table

Aggregate across follow-ons to produce a summary table per (strategy, conversation):

Mean fact recall (out of 1.0)
Mean coherence (out of 2.0)
Mean input tokens used

2. Per-Strategy Verdict

For each of the four conditions (baseline, sliding window, selective deletion, compaction), one short paragraph answering:

What did this strategy preserve well?
What did it lose?
Where did its tokens go?

3. The Interesting Cases

Identify at least three specific (strategy, conversation, follow-on) cells that surprised you — places where a strategy did better or worse than you expected. For each:

What did the agent actually produce?
Why did this strategy fail (or succeed) here? Trace the reason back to the mechanism — what did the strategy drop, summarize, or preserve that explains the outcome?
What does that tell you about when this strategy is brittle?

This section is the analytical core of the lab. The token numbers will roughly match expectations. The interesting findings are in the disagreements between expectation and observation.

4. When Would You Use Each?

Conclude with a one-paragraph recommendation per strategy: in what kind of agent task would you reach for this strategy first, and why? Be specific — "long debugging sessions where the original symptom matters" beats "long tasks."

Deliverable

lab-05/
├── agent.py                  # extended with sliding_window, selective_delete, compact
├── results.csv               # 36 rows with token counts and your three score columns
├── analysis.md               # 2–3 pages
└── compaction_summaries/     # the actual summaries your compact() generated, one per conversation

Looking Ahead

The data you produce in this lab is the empirical foundation for Assignment 2: Coding Agent with Token Analysis, due at the end of Module 7. The assignment will extend this analysis with cost projections at scale, examination of edge cases your matrix didn't cover, and a design proposal for a hybrid strategy that combines the three approaches. Keep your results.csv, your compaction summaries, and the raw response transcripts — you will reference them.

The broader takeaway of this lab is the one Module 7 has been building toward: there is no single best context management strategy. Each one preserves and loses different things. A production agent typically combines all three, applying the cheapest strategy first and escalating only when needed. Lab 5 gives you the empirical grounding to make those combination decisions on real systems later.