Lab 2: API Exploration and Context Experiments

Section 2 Lab | Agent Engineering Duration: ~2-3 hours Prerequisites: Lab 1 + Modules 3-4 (Lectures 3.1 through 4.3)

Overview

In Lab 1, you set up your environment and built a chat loop. In this lab, you'll use those skills to run experiments — systematically testing how LLMs behave under different conditions and observing context dynamics firsthand.

This is an observation lab. You'll write code, but the deliverable is a lab report documenting what you found. The goal is to build calibrated intuition about what LLMs can do, where they break, and how context affects quality.

Model choice: You may use any frontier model from Anthropic, OpenAI, or Google for this lab. We recommend Claude Haiku or Sonnet for cost efficiency — you'll be making many API calls. Whichever model(s) you use, document them clearly in your report. If you compare models, that's worth noting too.

What You'll Produce

A lab report (Markdown file) documenting:

Your generation parameter experiments and observations
Capability boundaries and failure modes you discovered
Context growth data and recall test results
Your prompts, your methodology, and your specific findings

Important: Your report must include the actual prompts you used and the specific results you observed. Generic summaries like "quality decreased" aren't sufficient — show the data.

Part 1: Generation Parameters (~30 minutes)

1.1 Temperature Effects

Write a script that sends the same prompt to the API multiple times at different temperature values: 0.0, 0.3, 0.7, and 1.0. Run each temperature 3 times (12 calls total).

Use a prompt that's sensitive to variation — something creative where there's no single correct answer. For example, you might ask the model to name a startup or describe an imaginary animal. Do not use these examples — come up with your own.

Record in your report:

The prompt you used
At what temperature do responses start diverging?
At what temperature do responses become incoherent or unusable (if they do)?
A table or list showing the responses at each temperature

1.2 Temperature and Code

Repeat the experiment with a coding prompt — ask the model to write a specific function or solve a programming problem.

Record in your report:

Does temperature affect code quality differently than creative tasks?
At what temperature does the code start containing errors?
What does this imply for agents that need to make reliable tool calls?

1.3 Top-p Sampling

Hold temperature at 1.0 and vary top-p: 0.1, 0.5, 0.9. Run each 3 times with the same creative prompt from 1.1.

Record in your report:

How does top-p interact with temperature?
At low top-p (0.1) with high temperature (1.0), what happens?

1.4 Stop Sequences

Send a prompt that asks the model to produce a numbered list (at least 10 items). Add a stop sequence that cuts it off mid-list — for example, stopping at item 5.

Record in your report:

The stop sequence you used
What the truncated output looks like
When stop sequences might be useful in an agent context

Cost tip: Use Haiku for this section — you're making many calls for comparison, and model quality isn't the point. Switch to Sonnet or a frontier model for Parts 2 and 3 where quality matters.

Part 2: Capability Testing and Failure Modes (~30 minutes)

2.1 Capability Probes

Test the model on tasks that stress different abilities. For each category below, design your own test prompts and document the results.

Reasoning: Give the model a multi-step logic puzzle or deduction problem. Example direction: "something that requires 3+ steps of inference." Try making it progressively harder. Where does it break?

Arithmetic: Test with small numbers, then increase. At what scale does the model start making errors? Try both multiplication and division.

Code generation: Ask for a well-known algorithm, then something more obscure or domain-specific. Compare the quality.

Knowledge boundaries: Ask about events or facts from different time periods. Can you find the model's knowledge cutoff? Ask about something obscure but verifiable — does the model admit uncertainty or fabricate?

2.2 Inducing Failures

Deliberately try to make the model fail. These failure modes trace directly back to training (Lecture 2.2 — RLHF side effects, internet training data patterns).

Hallucination: Ask about something that sounds plausible but doesn't exist — a fake research paper, a made-up Python library, a fictional historical event. Does the model generate confident-sounding nonsense, or does it flag uncertainty? Example direction: "ask about a specific paper by a real author on a topic they didn't write about." Do not use this example — design your own.

Sycophancy: State something factually wrong with confidence and see if the model agrees or pushes back. Try varying how assertively you state the wrong thing.

Conflicting instructions: Put one instruction in the system prompt and a contradicting instruction in the user message. Which one wins? Try several variations.

Record in your report:

The specific prompts you used for each failure mode
The model's responses (quoted or paraphrased)
Your assessment: how reliable is the model, and what does this mean for building agents that need to act autonomously?

Part 3: Context Experiments (~45-60 minutes)

This is the core of the lab. You'll observe the concepts from Module 4 firsthand.

3.1 Context Growth Observation

Start from your Lab 1 chat loop (or write a new one). Add cumulative token tracking — after each turn, print:

Input tokens for this turn
Output tokens for this turn
Cumulative input tokens across the session

Have a 20+ turn conversation on a consistent topic. Record the input token count at turns 1, 5, 10, 15, and 20.

Record in your report:

A table of turn number vs. input token count
The growth pattern — is it linear? Faster than linear?
Why input tokens grow even though you're only typing short messages each turn

3.2 Recall Testing

This experiment tests whether the model can retrieve specific facts as context grows.

Setup: In your first message, tell the model 5 specific facts. Make them distinct and unambiguous — a name, a number, a color, a city, and a date. For example: "My cat's name is Marble, I own 7 guitars, my favorite color is teal, I grew up in Lisbon, and I graduated on June 3rd, 2019." Use your own facts, not these.

Procedure:

After stating the facts, continue the conversation on unrelated topics for 10 turns
Quiz the model on all 5 facts. Record which it gets right.
Continue for another 20 unrelated turns (30 total)
Quiz again. Record results.
Continue for another 20 turns (50 total)
Quiz one more time. Record results.

To save time, use the context padding script below to automate the unrelated turns.

context_padder.py:

"""
Pads a conversation with N unrelated exchanges to test context recall.
Sends generic questions on random topics to grow the conversation
without introducing information related to your test facts.
"""

import random
from dotenv import load_dotenv
from anthropic import Anthropic

load_dotenv()
client = Anthropic()

PADDING_PROMPTS = [
    "What's the difference between a stack and a queue?",
    "Explain how a combustion engine works in simple terms.",
    "What are three interesting facts about octopuses?",
    "How does a binary search algorithm work?",
    "What causes thunder?",
    "Describe how a refrigerator keeps food cold.",
    "What's the tallest building in the world and how tall is it?",
    "Explain the water cycle in a few sentences.",
    "How does email actually get delivered?",
    "What's the difference between HTTP and HTTPS?",
    "How do noise-canceling headphones work?",
    "What causes the tides?",
    "How does a microwave heat food?",
    "What's the difference between RAM and a hard drive?",
    "How do airplanes stay in the air?",
    "What causes a rainbow?",
    "How does GPS know where you are?",
    "What's the difference between a virus and a bacterium?",
    "How does a touchscreen detect your finger?",
    "What causes earthquakes?",
]


def pad_conversation(messages, model, system_prompt, num_turns=10):
    """
    Add num_turns of unrelated exchanges to the conversation.
    Returns the updated messages list and token counts per turn.
    """
    prompts = random.sample(PADDING_PROMPTS, min(num_turns, len(PADDING_PROMPTS)))
    if num_turns > len(PADDING_PROMPTS):
        # Repeat if we need more turns than available prompts
        prompts = prompts * (num_turns // len(prompts) + 1)
        prompts = prompts[:num_turns]

    token_log = []

    for i, prompt in enumerate(prompts, 1):
        messages.append({"role": "user", "content": prompt})
        response = client.messages.create(
            model=model,
            max_tokens=256,
            system=system_prompt,
            messages=messages,
        )
        assistant_text = response.content[0].text
        messages.append({"role": "assistant", "content": assistant_text})

        token_log.append({
            "turn": i,
            "input_tokens": response.usage.input_tokens,
            "output_tokens": response.usage.output_tokens,
        })
        print(f"  Padding turn {i}/{num_turns} — "
              f"input: {response.usage.input_tokens}, "
              f"output: {response.usage.output_tokens}")

    return messages, token_log

Use this by importing pad_conversation into your experiment script. Pass in your existing messages list, and it will add unrelated exchanges while tracking tokens.

Record in your report:

Your 5 test facts
Recall accuracy at each checkpoint (after 10, 30, and 50 turns)
Which facts were forgotten first (beginning of the message? middle?)
Input token count at each checkpoint

3.3 Compaction Experiment

Now test whether compaction (from Lecture 4.2) preserves recall while reducing token count.

Procedure:

Build up a conversation to 40+ turns using the padding script (seed it with your 5 facts first)
Record the input token count and quiz the model on your 5 facts
Implement compaction: send the full conversation to the model with a prompt instructing it to summarize — preserving key facts, decisions, and current goals while discarding raw exchanges
Replace the conversation history with the summary (as a single user message + an assistant acknowledgment)
Record the new input token count
Quiz the model on the same 5 facts from the compacted state
Continue the conversation for 10 more turns, then quiz again

Record in your report:

Input tokens before compaction vs. after
The compression ratio (before / after)
Recall accuracy before compaction, immediately after, and 10 turns later
The summary the model produced (include it in your report)
Your assessment: what was preserved? What was lost?

Deliverable

A lab report in Markdown format (lab-02-report.md) containing:

Generation Parameters — temperature/top-p observations with a table of results, conclusion about when to use low vs. high temperature for agents
Capability Boundaries — what the model can and can't do, with your specific test prompts and results
Failure Modes — hallucination, sycophancy, and conflict examples with quoted responses
Context Experiments — token growth data, recall test results at each checkpoint, compaction before/after analysis
Key Takeaways — 3-5 lessons you'd apply when building an agent, grounded in what you observed

Include all prompts you used, all models you tested, and specific observations — not just conclusions.

Submit your lab-02-report.md along with any scripts you wrote.

Troubleshooting

API calls are slow with long conversations This is expected — longer context means more processing time. If Part 3 feels slow, switch to Haiku for the padding turns. The model choice matters most for the quiz turns where you're testing recall quality.

Token counts seem wrong or missing Make sure you're reading response.usage.input_tokens and response.usage.output_tokens. Some SDK versions may structure this differently — check the API reference for your provider if you're not using Anthropic.

Compaction summary is too short / loses important facts Your summarization prompt matters. Be specific about what to preserve. If the summary drops your test facts, revise the prompt to explicitly instruct preservation of "all specific facts, names, numbers, and dates mentioned by the user."

Rate limit errors during padding You're making many calls in a loop. Add a small delay (import time; time.sleep(0.5)) between calls if you hit rate limits. Haiku has higher rate limits than Sonnet/Opus.

Using a non-Anthropic model The experiments work with any frontier model. The SDK calls will differ — consult your provider's documentation. The concepts (token counting, context growth, recall degradation) are universal.