Lab 3: The Booking Agent

Section 3 Lab | Agent Engineering Duration: ~2.5 hours Prerequisites: Module 5 (Lectures 5.1–5.3)

Overview

In this lab you will apply the prompting principles from Module 5 to a concrete, observable problem: a scheduling agent that manages calendars for a small team.

The agent loop and all tools are already implemented. Your only job is to write the system prompt in system_prompt.md. The quality of your prompt determines everything — whether the agent checks before booking, respects team preferences, sends properly formatted emails, or ignores most of the rules entirely.

You will start with a deliberately poor system prompt, observe its failures, write a better one, and run the same scripted test both times to compare results. You will then analyze what changed — not just in score, but in token usage — and form a view on where the sweet spot between prompt quality and efficiency sits.

What You're Given

All files are in labs/lab03/:

File	Purpose
`agent.py`	Interactive agent — test your prompt manually
`test_script.py`	Automated 8-turn test with scoring (11 pts total)
`reset.py`	Resets all calendars and logs to initial state
`tools.py`	Tool implementations — do not modify
`system_prompt.md`	The only file you edit
`bad_prompt.md`	The starting bad prompt (reference copy)
`good_prompt.md`	A well-written reference prompt — do not open until after your own attempt
`calendar_ada.md`	Ada's calendar
`calendar_alan.md`	Alan's calendar
`calendar_emmy.md`	Emmy's calendar
`email_log.txt`	All emails logged by the agent
`trace.log`	Full tool call trace

The agent uses Claude Haiku (claude-haiku-4-5-20251001). This is intentional. A weaker model exposes prompt weaknesses much more visibly than a frontier model would — making the impact of your improvements concrete and measurable.

Setup

From labs/lab03/:

pip install -r requirements.txt

Your Anthropic API key must be in a .env file in the labs/lab03/ directory. The lab code loads it automatically.

The Scenario

You manage the schedule for a three-person team: Ada, Alan, and Emmy. Their calendars are markdown files in the lab directory. The agent handles booking, cancellation, availability checks, and email notification.

The Six Tools

The agent has six tools. Understanding what each tool does — and when the agent should use it — is essential for writing a good system prompt. A prompt that does not tell the agent when and in what order to call tools will produce inconsistent, incorrect behavior.

list_people()

Returns the list of team members and their current scheduling preferences. Takes no arguments.

This tool exists because preferences are not static — they may change. A prompt that hardcodes preference information is a prompt that becomes stale. The right design is to instruct the agent to call this tool at the start of every booking turn so preferences are always retrieved from the authoritative source.

What it returns (currently):

Team members and their scheduling preferences:
- Ada: No scheduling restrictions.
- Alan: Prefers no meetings before 10:00am.
- Emmy: Prefers group meetings only (3 or more attendees). Do not book 1:1 meetings with Emmy.

read_calendar(person, date)

Returns a person's full schedule for a given date, formatted as a list of appointments with start/end times and titles. Use this when you need to understand someone's full day — for example, when searching for a free slot for a group meeting.

check_availability(person, date, time, duration_minutes)

Checks whether a specific person is free at a given time for a given duration. Returns "available" or a description of the conflicting appointment.

This tool should be called for every attendee before booking any meeting — including when the agent believes it already knows the schedule from earlier in the conversation. Memory is not a substitute for a tool call.

This same principle applies when declining a booking due to a conflict: the agent should call check_availability to confirm the conflict exists before refusing. Do not decline based on recall alone.

book_appointment(attendees, date, time, duration_minutes, title)

Writes an appointment to all attendees' calendar files simultaneously. Should only be called after check_availability confirms every attendee is free. Must include all attendees.

send_email(to, subject, body)

Logs an email to email_log.txt. A confirmation email should be sent after every successful booking; a notification email should be sent after every cancellation.

The subject line format matters. A well-designed prompt specifies:

Booking confirmations: Confirmed: [title] on [day] at [time]
Cancellation notifications: Cancelled: [title] on [day]

For group booking emails, the email body should name all attendees.

cancel_appointment(person, date, time)

Removes an existing appointment from a person's calendar by matching the start time. After cancelling, send the person a notification email.

Business Rules

In addition to correct tool usage, the agent must follow these rules:

15-minute buffer rule. Never book a meeting that starts within 15 minutes of another meeting ending for the same person. If a meeting ends at 12:00pm, the earliest the next meeting for that person can start is 12:15pm.

Working hours. Only book meetings between 9:30am and 4:30pm.

Group meeting protocol. When no time is specified for a multi-person meeting, the agent must read all attendees' calendars, compute free windows for each person (applying the buffer rule), find the earliest common window, then check availability and book.

The Test Script

test_script.py runs an 8-turn scripted conversation against your system prompt and scores each turn. It resets all state automatically at startup — you do not need to run reset.py before running the test.

Scoring: 11 points total

Turn	User Request	What Is Tested	Points
1	Book a 1:1 with Ada on Monday at 11am	Check before booking; detect conflict; do NOT book	2
2	OK, book it at 10am instead	Check; book; send email with `Confirmed:` in subject	1
3	Schedule an hour-long team meeting with Ada, Alan, Emmy on Monday	Read all 3 calendars; respect buffer rule; book all 3; email body names all attendees	2
4	Book a 15-minute catch-up with just Emmy on Monday at 2pm	Call `list_people`; recognize Emmy's preference; refuse the 1:1	2
5	Book a catch-up with Alan at the same time as the team meeting	Call `check_availability`; detect conflict; do NOT book	1
6	Ada's weekly sync has been cancelled — remove it and notify her	Cancel the appointment; send email with `Cancelled:` in subject	1
7	Find a 30-minute slot for Ada and Alan before the team meeting and book it	Find valid slot respecting buffer; book both; email both	1
8	Send the agenda to all three team meeting attendees	Email Ada, Alan, AND Emmy	1

Each turn, the script prints:

The tool calls made (in order)
The agent's response (truncated to 300 chars)
Token counts for that turn
A per-check PASS / PARTIAL / FAIL verdict
Points earned

At the end, you get a final scorecard with a visual bar for each turn and total tokens used.

Reading failures. The verdict lines tell you exactly what went wrong. A [FAIL] check_availability NOT called means the agent booked (or refused) without verifying. A [FAIL] Email subject does NOT contain 'Confirmed:' means the email format rule was not specified clearly enough. Use these messages as a checklist for what to add to your prompt.

Activity 1: Diagnose the Bad Prompt (20 min)

system_prompt.md starts with the bad prompt. Run the test:

cd labs/lab03
python test_script.py

Watch each turn carefully. For each failure or partial, record:

What the agent did wrong
Which anti-pattern from the lectures explains it — vague role, missing constraints, no tool sequencing, no output format requirements, etc.
What specific instruction or rule would have fixed it

Some failures are obvious (the agent books without checking). Others are subtle (the email subject doesn't follow any format, so the subject check fails). The bad prompt fails both kinds.

Tip: Also look at which turns pass despite the bad prompt. The model's general intelligence fills in some gaps on simple tasks. Your goal in Activity 2 is to close the gap on the harder ones — particularly Emmy's preference and the email format rules, which require explicit instruction to pass.

Activity 2: Write an Improved System Prompt (60 min)

Edit system_prompt.md. Apply the system prompt architecture from Lecture 5.2:

1. Identity and role. Who is this agent? What is its purpose? What does it manage?

2. Tool reference. For each tool: what it does, when to call it, and any constraints on sequencing. The tool descriptions in the agent schema already provide basic information — your system prompt is where you add usage rules. For example: "Call check_availability for every attendee before booking — even if you recall the schedule from earlier in this conversation."

3. Business rules. Each rule as a named, explicit block. Don't bury the 15-minute buffer rule in a sentence — give it a heading and state it precisely. Same for working hours and group meeting protocol.

4. Team preferences. This is the part most students underestimate. Emmy does not accept 1:1 meetings. Alan does not take early-morning meetings. These facts are not obvious to the model — they must be stated explicitly. The list_people tool returns current preferences, but your prompt needs to instruct the agent to call that tool and what to do with the result. Telling the model to call list_people and then also stating the current rules is not redundant — the tool call is what ensures the agent always works from current data, not stale prompt text.

5. Output format. Email subject line conventions. Response format after booking, after cancellation, after a conflict. If you don't specify these, the model will invent a format and the email content checks will fail.

6. Protocol. A step-by-step sequence for each action type. Booking: (1) call list_people, (2) call check_availability for every attendee, (3) if available, call book_appointment, (4) call send_email with proper subject. Group scheduling: read all calendars first, find free windows, then run the booking protocol. Cancellation: cancel, then email.

Testing as you go

Use agent.py for fast interactive iteration:

python agent.py

Try these cases manually before running the full test:

"Book a 1:1 with Ada on Monday at 11am" — does it check first? Does it report the conflict and suggest alternatives?
"Book a 1:1 with Emmy on Tuesday" — does it refuse and explain why?
"Set up a team meeting with everyone on Monday, find a time" — does it read all three calendars?

When you think the prompt is ready, run the full test:

python test_script.py

The test resets state automatically. Iterate on the failures and re-run until your score plateaus. It's fine to run the test 5-10 times as you iterate.

On the `list_people` tool

Turn 4 specifically tests whether the agent calls list_people before acting on a 1:1 request. This is one of the harder points to earn because the model will often read the preference directly from your prompt text and skip the tool call. To force the tool call, your prompt needs to make the mandate explicit and give the model a reason: preferences are subject to change and the tool is the only authoritative source.

This is a good example of a general principle: telling the model what to do is not enough when the model can shortcut it. You have to explain why the tool call is required — that the in-prompt knowledge may be stale.

Activity 3: Measure and Compare (30 min)

Run the test with the bad prompt, then with your improved prompt. Record the per-turn scores and token counts from both runs.

# Run 1: bad prompt (saves your work first)
cp system_prompt.md my_prompt.md
cp bad_prompt.md system_prompt.md
python test_script.py

# Run 2: restore your prompt
cp my_prompt.md system_prompt.md
python test_script.py

The test prints per-turn token counts and totals. Fill in both tables:

Scoring table:

Turn	Bad Prompt	Your Prompt
1 — Conflict detection (Ada 11am)	/2	/2
2 — Check + book + Confirmed: email	/1	/1
3 — Group scheduling (3-person)	/2	/2
4 — Emmy 1:1 preference	/2	/2
5 — Conflict detection (Alan team meeting)	/1	/1
6 — Cancel + Cancelled: email	/1	/1
7 — Find slot + book Ada & Alan	/1	/1
8 — Agenda email to all 3	/1	/1
Total score	/11	/11

Token table (from the scorecard at the end of each run):

	Bad Prompt	Your Prompt
Total input tokens
Total output tokens
Total tokens

Now open good_prompt.md and read it carefully. It is intentionally thorough. Note what it does that yours does not.

Token analysis

Once you have both runs' data, go deeper. The test script prints per-turn token counts — record them individually for both runs.

Think about what drives the token counts:

System prompt length. Every API call includes the full system prompt. A longer prompt adds tokens to every single turn of the conversation. The system prompt is not free.
Tool call loops. Each time the agent calls a tool, the API is called again — with the same (growing) conversation history plus the tool result. A turn that requires 3 tool calls makes 3 API calls. Each one costs input tokens for the entire history so far.
Iteration and recovery. When the agent makes a wrong decision and the user corrects it, that correction becomes part of the conversation history. Every subsequent turn carries the cost of that detour. A prompt that prevents wrong decisions saves tokens over time — not on the turn where the mistake happens, but on every turn that follows.
The list_people overhead. The good prompt calls list_people on every booking turn. That is an extra tool call and an extra API round-trip. In this 8-turn test, that adds meaningful token cost. Is the added cost justified? Consider what Turn 4 tests: a model that skips list_people and reads Emmy's preference directly from the prompt will fail the turn if the tool call is required for credit. In a production system, a model that relies on stale prompt text instead of calling the tool would eventually schedule a meeting that violates a changed preference. The tool call is insurance. You are deciding how much insurance is worth.

With this in mind, answer: is there a version of your prompt that achieves a comparable score at lower token cost? Try stripping out sections that don't affect the score and re-running. Where does the score drop first?

Activity 4: Reflection (20 min)

Write a short report covering the following. Be specific — reference your data, not general impressions.

Required sections

1. Failure analysis For the bad prompt run, describe each failing turn. What did the agent do? What was missing from the prompt that caused the failure? Quote the failure verdict line from the test output.

2. Prompt changes and their impact For your best-scoring prompt, identify the two or three changes that had the largest effect on accuracy. Why did those specific changes work? Reference the prompting principles from the lectures.

3. Token usage analysis

Include both token tables (score + token counts per turn for bad and good runs).

Then answer:

Which turns consumed the most tokens in each run, and why?
Did the more accurate prompt use more or fewer total tokens than the bad prompt? Was this what you expected?
The system prompt is included in every API call. Given that, what's the cost of adding 500 more words to a prompt that runs thousands of conversations per day?
Is there a version of your prompt that scores 9/11 or better but uses significantly fewer tokens than good_prompt.md? Describe what you kept and what you cut.
What is your assessment of the sweet spot between prompt quality and token efficiency for this agent?

4. Comparison with good_prompt.md What does good_prompt.md do that your prompt does not? Is the gap a missing rule, a missing tool instruction, or a missing output format requirement? Which principle from Lectures 5.1–5.2 explains why the reference prompt includes it?

Deliverables

Submit:

Your final system_prompt.md
Your lab report (as lab-03-report.md) containing all four sections above

Tips

Start from failures, not from theory. Each test failure tells you exactly what instruction is missing. The verdict messages are specific — use them as a checklist. Fix one thing, re-run, move on.

Positive instructions outperform prohibitions. "Call check_availability for every attendee before booking" is more reliable than "never book without checking." The first tells the model what to do; the second just tells it what not to do.

Tool descriptions are part of your prompt. The model sees the tool schema descriptions when it decides whether to call a tool. Put usage rules in the system prompt as well — the model needs to know the sequence, not just the definition.

Mandatory tool calls require justification. If you want the model to call a tool even when it thinks it already knows the answer, you need to explain why the tool call is required. "Preferences may change — the tool is the authoritative source" gives the model a reason to call list_people instead of reading from the prompt.

The email format is invisible until it isn't. Turns 2 and 6 test specific subject line content. The model will invent a reasonable-sounding subject that happens to fail the check. The only fix is to specify the format explicitly. This is a good example of why output format specifications belong in the system prompt, not left to inference.

Distinguish memory from verification. A well-prompted agent calls check_availability even when it remembers a conflict from earlier in the conversation. This is a testable requirement (Turn 5). The model needs an explicit instruction — and a reason — to call the tool rather than rely on its context.

Don't over-engineer. The token analysis in Activity 3 is meant to push back on the assumption that more prompt is always better. Some instructions in good_prompt.md only affect one or two turns. Are they worth their token cost at scale? That's a real engineering tradeoff, not a gotcha — and the lab is asking you to think through it.