Section 2 Lab | Agent Engineering Duration: ~2-3 hours Prerequisites: Lab 1 + Modules 3-4 (Lectures 3.1 through 4.3)
In Lab 1, you set up your environment and built a chat loop. In this lab, you'll use those skills to run experiments — systematically testing how LLMs behave under different conditions and observing context dynamics firsthand.
This is an observation lab. You'll write code, but the deliverable is a lab report documenting what you found. The goal is to build calibrated intuition about what LLMs can do, where they break, and how context affects quality.
Model choice: You may use any frontier model from Anthropic, OpenAI, or Google for this lab. We recommend Claude Haiku or Sonnet for cost efficiency — you'll be making many API calls. Whichever model(s) you use, document them clearly in your report. If you compare models, that's worth noting too.
A lab report (Markdown file) documenting:
Important: Your report must include the actual prompts you used and the specific results you observed. Generic summaries like "quality decreased" aren't sufficient — show the data.
Write a script that sends the same prompt to the API multiple times at different temperature values: 0.0, 0.3, 0.7, and 1.0. Run each temperature 3 times (12 calls total).
Use a prompt that's sensitive to variation — something creative where there's no single correct answer. For example, you might ask the model to name a startup or describe an imaginary animal. Do not use these examples — come up with your own.
Record in your report:
Repeat the experiment with a coding prompt — ask the model to write a specific function or solve a programming problem.
Record in your report:
Hold temperature at 1.0 and vary top-p: 0.1, 0.5, 0.9. Run each 3 times with the same creative prompt from 1.1.
Record in your report:
Send a prompt that asks the model to produce a numbered list (at least 10 items). Add a stop sequence that cuts it off mid-list — for example, stopping at item 5.
Record in your report:
Cost tip: Use Haiku for this section — you're making many calls for comparison, and model quality isn't the point. Switch to Sonnet or a frontier model for Parts 2 and 3 where quality matters.
Test the model on tasks that stress different abilities. For each category below, design your own test prompts and document the results.
Reasoning: Give the model a multi-step logic puzzle or deduction problem. Example direction: "something that requires 3+ steps of inference." Try making it progressively harder. Where does it break?
Arithmetic: Test with small numbers, then increase. At what scale does the model start making errors? Try both multiplication and division.
Code generation: Ask for a well-known algorithm, then something more obscure or domain-specific. Compare the quality.
Knowledge boundaries: Ask about events or facts from different time periods. Can you find the model's knowledge cutoff? Ask about something obscure but verifiable — does the model admit uncertainty or fabricate?
Deliberately try to make the model fail. These failure modes trace directly back to training (Lecture 2.2 — RLHF side effects, internet training data patterns).
Hallucination: Ask about something that sounds plausible but doesn't exist — a fake research paper, a made-up Python library, a fictional historical event. Does the model generate confident-sounding nonsense, or does it flag uncertainty? Example direction: "ask about a specific paper by a real author on a topic they didn't write about." Do not use this example — design your own.
Sycophancy: State something factually wrong with confidence and see if the model agrees or pushes back. Try varying how assertively you state the wrong thing.
Conflicting instructions: Put one instruction in the system prompt and a contradicting instruction in the user message. Which one wins? Try several variations.
Record in your report:
This is the core of the lab. You'll observe the concepts from Module 4 firsthand.
Start from your Lab 1 chat loop (or write a new one). Add cumulative token tracking — after each turn, print:
Have a 20+ turn conversation on a consistent topic. Record the input token count at turns 1, 5, 10, 15, and 20.
Record in your report:
This experiment tests whether the model can retrieve specific facts as context grows.
Setup: In your first message, tell the model 5 specific facts. Make them distinct and unambiguous — a name, a number, a color, a city, and a date. For example: "My cat's name is Marble, I own 7 guitars, my favorite color is teal, I grew up in Lisbon, and I graduated on June 3rd, 2019." Use your own facts, not these.
Procedure:
To save time, use the context padding script below to automate the unrelated turns.
context_padder.py:
"""
Pads a conversation with N unrelated exchanges to test context recall.
Sends generic questions on random topics to grow the conversation
without introducing information related to your test facts.
"""
import random
from dotenv import load_dotenv
from anthropic import Anthropic
load_dotenv()
client = Anthropic()
PADDING_PROMPTS = [
"What's the difference between a stack and a queue?",
"Explain how a combustion engine works in simple terms.",
"What are three interesting facts about octopuses?",
"How does a binary search algorithm work?",
"What causes thunder?",
"Describe how a refrigerator keeps food cold.",
"What's the tallest building in the world and how tall is it?",
"Explain the water cycle in a few sentences.",
"How does email actually get delivered?",
"What's the difference between HTTP and HTTPS?",
"How do noise-canceling headphones work?",
"What causes the tides?",
"How does a microwave heat food?",
"What's the difference between RAM and a hard drive?",
"How do airplanes stay in the air?",
"What causes a rainbow?",
"How does GPS know where you are?",
"What's the difference between a virus and a bacterium?",
"How does a touchscreen detect your finger?",
"What causes earthquakes?",
]
def pad_conversation(messages, model, system_prompt, num_turns=10):
"""
Add num_turns of unrelated exchanges to the conversation.
Returns the updated messages list and token counts per turn.
"""
prompts = random.sample(PADDING_PROMPTS, min(num_turns, len(PADDING_PROMPTS)))
if num_turns > len(PADDING_PROMPTS):
# Repeat if we need more turns than available prompts
prompts = prompts * (num_turns // len(prompts) + 1)
prompts = prompts[:num_turns]
token_log = []
for i, prompt in enumerate(prompts, 1):
messages.append({"role": "user", "content": prompt})
response = client.messages.create(
model=model,
max_tokens=256,
system=system_prompt,
messages=messages,
)
assistant_text = response.content[0].text
messages.append({"role": "assistant", "content": assistant_text})
token_log.append({
"turn": i,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
})
print(f" Padding turn {i}/{num_turns} — "
f"input: {response.usage.input_tokens}, "
f"output: {response.usage.output_tokens}")
return messages, token_log
Use this by importing pad_conversation into your experiment script. Pass in your existing messages list, and it will add unrelated exchanges while tracking tokens.
Record in your report:
Now test whether compaction (from Lecture 4.2) preserves recall while reducing token count.
Procedure:
Record in your report:
A lab report in Markdown format (lab-02-report.md) containing:
Include all prompts you used, all models you tested, and specific observations — not just conclusions.
Submit your lab-02-report.md along with any scripts you wrote.
API calls are slow with long conversations This is expected — longer context means more processing time. If Part 3 feels slow, switch to Haiku for the padding turns. The model choice matters most for the quiz turns where you're testing recall quality.
Token counts seem wrong or missing
Make sure you're reading response.usage.input_tokens and response.usage.output_tokens. Some SDK versions may structure this differently — check the API reference for your provider if you're not using Anthropic.
Compaction summary is too short / loses important facts Your summarization prompt matters. Be specific about what to preserve. If the summary drops your test facts, revise the prompt to explicitly instruct preservation of "all specific facts, names, numbers, and dates mentioned by the user."
Rate limit errors during padding
You're making many calls in a loop. Add a small delay (import time; time.sleep(0.5)) between calls if you hit rate limits. Haiku has higher rate limits than Sonnet/Opus.
Using a non-Anthropic model The experiments work with any frontier model. The SDK calls will differ — consult your provider's documentation. The concepts (token counting, context growth, recall degradation) are universal.