Section 4 Lab | Agent Engineering Duration: ~3 hours Prerequisites: Lab 4 + Module 7 (Lectures 7.1–7.3)
In Module 7 you learned three strategies for managing the agent's context window:
Each strategy makes a different trade between coherence (does the agent remember what matters?) and efficiency (how many tokens does it use?). On a short task none of this matters. On a long task, the choice determines whether the agent loses the thread.
This lab is a controlled experiment to expose those trade-offs. You will:
The conversations and follow-ons are provided. They are engineered to make each strategy fail in characteristic ways. The goal of the lab is not to "get it working" — the goal is to see what each strategy preserves, what it loses, and what that tells you about when to reach for each one.
In lab05/ you will find:
conversations/ — three pre-built long conversations as JSON message arrays:conv_a_debug.json — a multi-file debugging session (~60 messages)conv_b_features.json — a multi-feature build (~80 messages)conv_c_refactor.json — a constrained refactor (~70 messages)tags/ — selective-preservation tag annotations for each conversation (one file per conversation, marking each message as preserve or expirable)followons/ — three designed follow-on prompts per conversation (9 total)planted_facts.md — documentation of what each follow-on is designed to test, including the specific fact, its position in the conversation, and the strategies expected to pass or failharness.py — a test harness that loads a conversation, applies a strategy, appends a follow-on, runs your agent, and records token counts, peak context size, and the agent's responseYou do not generate the conversations or tags yourself. Using identical inputs across all students makes results comparable and turns the lab into a real experiment rather than a demo.
agent.py:sliding_window(messages, n)selective_delete(messages, tags, preserve_recent)compact(messages, llm_client, threshold_tokens)results.csv with token counts and your manual scoresanalysis.md, 2–3 pages) — which strategy won where, why, and when you would use each in a real agentKeep the most recent N messages, always preserving the system prompt at index 0.
Signature: sliding_window(messages: list, n: int = 20) -> list
Requirements:
n messages are preserved.n + 1 messages, return it unchanged.Drop expirable old messages while preserving the system prompt, tagged-as-important messages, and recent messages.
Signature: selective_delete(messages: list, tags: list, preserve_recent: int = 10) -> list
Where tags is a parallel list, one entry per message, with values "preserve" or "expirable".
Requirements:
"preserve" is kept regardless of position.preserve_recent messages are kept regardless of tag.About the tags. For this lab, the tag files are provided so the lab stays focused on strategy comparison rather than tagging policy. In a real agent, you would need to generate tags programmatically. The provided tags follow a simple policy you could implement yourself:
preserve.preserve — users state goals, ask questions, and give feedback. These tend to matter later in a session.preserve — these are the model's reasoning, decisions, and explanations. Losing them loses the agent's logic chain.expirable — the tool call itself is usually re-derivable from the result and the surrounding context.expirable by default, but can be promoted to preserve if they contain explicit decision-relevant content (a search hit the agent then acted on, an error that drove a pivot, a constraint discovered from the environment). The provided tags promote roughly 10% of tool results under this rule.The policy above could be applied automatically at message-append time: the role and a check for tool_use blocks tells you almost everything; the only piece requiring a real heuristic is the 10% promotion of decision-relevant tool results, which a small auxiliary LLM call could classify. As you work through Part 2 you will see where this provided policy preserves the right things and where it doesn't — useful intuition for designing your own policy in later projects.
When context exceeds a threshold, summarize the older portion of the conversation into a single message and replace the originals with the summary.
Signature: compact(messages: list, llm_client, threshold_tokens: int = 8000, recent_keep: int = 10) -> list
Requirements:
messages. If under threshold_tokens, return unchanged.recent_keep messages, summarize everything in between via an LLM call, and return:
[system, summary_message, ...recent]user role message with content of the form:
"Summary of prior conversation:\n\n{summary}"Before moving on, run compact() on one of the provided conversations manually and read the summary it generates. If the summary is missing decisions, drops files, or hallucinates content, revise your summary prompt. A bad summary will silently destroy the rest of your experiment.
Use the provided harness.py. For each (strategy, conversation, follow-on) combination, the harness will:
none for baseline) to produce a managed message array.end_turn).Required runs (36 total):
| Condition | Conversations | Follow-ons | Runs |
|---|---|---|---|
| Baseline (no management) | 3 | 3 | 9 |
| Sliding window (n=20) | 3 | 3 | 9 |
| Selective deletion (preserve_recent=10) | 3 | 3 | 9 |
| Compaction (threshold=8000, recent_keep=10) | 3 | 3 | 9 |
The harness writes results to results.csv with columns:
strategy, conversation, followon_id, input_tokens, output_tokens, peak_context, response_text, response_path
Note: the baseline runs with no management may exceed the model's context window on the longer conversations. If a baseline run errors out, record that as a result — context exhaustion is itself a finding.
Each follow-on is designed to test whether the agent recalled or used a specific planted fact, documented in planted_facts.md. For each row in results.csv, score the agent's response on three criteria:
| Criterion | Score | Definition |
|---|---|---|
| Fact recall | 0 or 1 | Did the agent reference the planted fact correctly? |
| Coherence | 0, 1, or 2 | Does the response read as a natural continuation? (0 = confused or contradictory, 1 = adequate, 2 = clean and contextually appropriate) |
| Hedging | 0 or 1 | When the agent did not have the fact, did it appropriately surface uncertainty or admit missing context? (1 = yes; 0 = either confidently fabricated or refused unhelpfully) |
Add three columns (fact_recall, coherence, hedging) to results.csv and fill in your scores.
You score these manually. There is no auto-grader. The auto-grading approach (regex/keyword matching) is brittle — the agent will paraphrase, use synonyms, or reference the fact obliquely. Manual scoring is also itself part of the lesson: you will read what the agent actually produced under each strategy, which is the closest thing to seeing the strategy's effect with your own eyes.
A short note in analysis.md later about disagreements you had with yourself (cases where coherence felt like a 1.5) is worth more than a forced binary.
Write analysis.md (2–3 pages) covering the following:
Aggregate across follow-ons to produce a summary table per (strategy, conversation):
For each of the four conditions (baseline, sliding window, selective deletion, compaction), one short paragraph answering:
Identify at least three specific (strategy, conversation, follow-on) cells that surprised you — places where a strategy did better or worse than you expected. For each:
This section is the analytical core of the lab. The token numbers will roughly match expectations. The interesting findings are in the disagreements between expectation and observation.
Conclude with a one-paragraph recommendation per strategy: in what kind of agent task would you reach for this strategy first, and why? Be specific — "long debugging sessions where the original symptom matters" beats "long tasks."
lab-05/
├── agent.py # extended with sliding_window, selective_delete, compact
├── results.csv # 36 rows with token counts and your three score columns
├── analysis.md # 2–3 pages
└── compaction_summaries/ # the actual summaries your compact() generated, one per conversation
The data you produce in this lab is the empirical foundation for Assignment 2: Coding Agent with Token Analysis, due at the end of Module 7. The assignment will extend this analysis with cost projections at scale, examination of edge cases your matrix didn't cover, and a design proposal for a hybrid strategy that combines the three approaches. Keep your results.csv, your compaction summaries, and the raw response transcripts — you will reference them.
The broader takeaway of this lab is the one Module 7 has been building toward: there is no single best context management strategy. Each one preserves and loses different things. A production agent typically combines all three, applying the cheapest strategy first and escalating only when needed. Lab 5 gives you the empirical grounding to make those combination decisions on real systems later.