Lab 5: Context Management Strategies

Section 4 Lab | Agent Engineering Duration: ~3 hours Prerequisites: Lab 4 + Module 7 (Lectures 7.1–7.3)


Overview

In Module 7 you learned three strategies for managing the agent's context window:

Each strategy makes a different trade between coherence (does the agent remember what matters?) and efficiency (how many tokens does it use?). On a short task none of this matters. On a long task, the choice determines whether the agent loses the thread.

This lab is a controlled experiment to expose those trade-offs. You will:

  1. Implement all three strategies in your coding agent.
  2. Replay each strategy against three pre-built long conversations.
  3. Run a designed set of follow-on prompts against each (strategy, conversation) pair.
  4. Score the results and analyze which strategy wins where, and why.

The conversations and follow-ons are provided. They are engineered to make each strategy fail in characteristic ways. The goal of the lab is not to "get it working" — the goal is to see what each strategy preserves, what it loses, and what that tells you about when to reach for each one.


What's Provided

In lab05/ you will find:

You do not generate the conversations or tags yourself. Using identical inputs across all students makes results comparable and turns the lab into a real experiment rather than a demo.


What You'll Produce

  1. Three strategy implementations in your agent.py:
    • sliding_window(messages, n)
    • selective_delete(messages, tags, preserve_recent)
    • compact(messages, llm_client, threshold_tokens)
  2. A measurement matrix — 36 runs (4 conditions × 3 conversations × 3 follow-ons), recorded in results.csv with token counts and your manual scores
  3. An analysis writeup (analysis.md, 2–3 pages) — which strategy won where, why, and when you would use each in a real agent

Part 1: Implement the Three Strategies (~80 minutes)

1.1 Sliding Window (~15 min)

Keep the most recent N messages, always preserving the system prompt at index 0.

Signature: sliding_window(messages: list, n: int = 20) -> list

Requirements:

1.2 Selective Deletion (~25 min)

Drop expirable old messages while preserving the system prompt, tagged-as-important messages, and recent messages.

Signature: selective_delete(messages: list, tags: list, preserve_recent: int = 10) -> list

Where tags is a parallel list, one entry per message, with values "preserve" or "expirable".

Requirements:

About the tags. For this lab, the tag files are provided so the lab stays focused on strategy comparison rather than tagging policy. In a real agent, you would need to generate tags programmatically. The provided tags follow a simple policy you could implement yourself:

The policy above could be applied automatically at message-append time: the role and a check for tool_use blocks tells you almost everything; the only piece requiring a real heuristic is the 10% promotion of decision-relevant tool results, which a small auxiliary LLM call could classify. As you work through Part 2 you will see where this provided policy preserves the right things and where it doesn't — useful intuition for designing your own policy in later projects.

1.3 Compaction (~40 min)

When context exceeds a threshold, summarize the older portion of the conversation into a single message and replace the originals with the summary.

Signature: compact(messages: list, llm_client, threshold_tokens: int = 8000, recent_keep: int = 10) -> list

Requirements:

Before moving on, run compact() on one of the provided conversations manually and read the summary it generates. If the summary is missing decisions, drops files, or hallucinates content, revise your summary prompt. A bad summary will silently destroy the rest of your experiment.


Part 2: Run the Experiment Matrix (~50 minutes)

Use the provided harness.py. For each (strategy, conversation, follow-on) combination, the harness will:

  1. Load the conversation message array.
  2. Apply your strategy (or none for baseline) to produce a managed message array.
  3. Append the follow-on prompt.
  4. Run your agent for one turn (or until end_turn).
  5. Record: total input tokens, output tokens, peak context size, and the agent's full response.

Required runs (36 total):

Condition Conversations Follow-ons Runs
Baseline (no management) 3 3 9
Sliding window (n=20) 3 3 9
Selective deletion (preserve_recent=10) 3 3 9
Compaction (threshold=8000, recent_keep=10) 3 3 9

The harness writes results to results.csv with columns: strategy, conversation, followon_id, input_tokens, output_tokens, peak_context, response_text, response_path

Note: the baseline runs with no management may exceed the model's context window on the longer conversations. If a baseline run errors out, record that as a result — context exhaustion is itself a finding.


Part 3: Score the Follow-Ons (~30 minutes)

Each follow-on is designed to test whether the agent recalled or used a specific planted fact, documented in planted_facts.md. For each row in results.csv, score the agent's response on three criteria:

Criterion Score Definition
Fact recall 0 or 1 Did the agent reference the planted fact correctly?
Coherence 0, 1, or 2 Does the response read as a natural continuation? (0 = confused or contradictory, 1 = adequate, 2 = clean and contextually appropriate)
Hedging 0 or 1 When the agent did not have the fact, did it appropriately surface uncertainty or admit missing context? (1 = yes; 0 = either confidently fabricated or refused unhelpfully)

Add three columns (fact_recall, coherence, hedging) to results.csv and fill in your scores.

You score these manually. There is no auto-grader. The auto-grading approach (regex/keyword matching) is brittle — the agent will paraphrase, use synonyms, or reference the fact obliquely. Manual scoring is also itself part of the lesson: you will read what the agent actually produced under each strategy, which is the closest thing to seeing the strategy's effect with your own eyes.

A short note in analysis.md later about disagreements you had with yourself (cases where coherence felt like a 1.5) is worth more than a forced binary.


Part 4: Analyze (~40 minutes)

Write analysis.md (2–3 pages) covering the following:

1. Comparison Table

Aggregate across follow-ons to produce a summary table per (strategy, conversation):

2. Per-Strategy Verdict

For each of the four conditions (baseline, sliding window, selective deletion, compaction), one short paragraph answering:

3. The Interesting Cases

Identify at least three specific (strategy, conversation, follow-on) cells that surprised you — places where a strategy did better or worse than you expected. For each:

This section is the analytical core of the lab. The token numbers will roughly match expectations. The interesting findings are in the disagreements between expectation and observation.

4. When Would You Use Each?

Conclude with a one-paragraph recommendation per strategy: in what kind of agent task would you reach for this strategy first, and why? Be specific — "long debugging sessions where the original symptom matters" beats "long tasks."


Deliverable

lab-05/
├── agent.py                  # extended with sliding_window, selective_delete, compact
├── results.csv               # 36 rows with token counts and your three score columns
├── analysis.md               # 2–3 pages
└── compaction_summaries/     # the actual summaries your compact() generated, one per conversation

Looking Ahead

The data you produce in this lab is the empirical foundation for Assignment 2: Coding Agent with Token Analysis, due at the end of Module 7. The assignment will extend this analysis with cost projections at scale, examination of edge cases your matrix didn't cover, and a design proposal for a hybrid strategy that combines the three approaches. Keep your results.csv, your compaction summaries, and the raw response transcripts — you will reference them.

The broader takeaway of this lab is the one Module 7 has been building toward: there is no single best context management strategy. Each one preserves and loses different things. A production agent typically combines all three, applying the cheapest strategy first and escalating only when needed. Lab 5 gives you the empirical grounding to make those combination decisions on real systems later.