class: center, middle, inverse count: false # Controlling Generation --- # What Parameters Control Generation? In Lecture 3.1, we covered the structure of an API call — what goes in, what comes out. But we glossed over the parameters that control **how** the model generates its response. -- These aren't obscure settings. They directly affect whether your agent is: - **reliable** or erratic - **concise** or verbose - **creative** or deterministic ??? [1-2 min] Frame this as practical. Students will see how to apply these settings immediately. --- class: center, middle, inverse # Temperature --- # Temperature Controls Randomness From Lecture 2.1: the model produces a **probability distribution** over possible next tokens. Temperature controls how that distribution is used. -- **Temperature = 0.0** (deterministic) The model always picks the most probable token. Same input → same output every time. -- **Temperature = 1.0** (creative) The model samples across the full distribution. Less likely tokens have a real chance. Different every time. ??? [2 min] Connect back to the probability distribution concept. Temperature is the control knob on that distribution. --- # What This Looks Like in Practice **Prompt:** "Name a color." -- **Temperature 0:** "Blue." Every time. -- **Temperature 0.3:** "Blue." Usually. Occasionally "Red" or "Green." -- **Temperature 1.0:** "Blue." Sometimes. But also "Cerulean," "Mauve," "Burnt sienna." Different every time. -- The underlying knowledge doesn't change. What changes is how much the model **explores beyond the most obvious answer.** ??? [2 min] The color example makes temperature tangible. Students should be able to picture what each setting does. But don't just talk about it — let's see it. --- class: center, middle, inverse # Live Coding ## `temperature_demo.py` ??? [3-4 min] Switch to terminal. Run temperature_demo.py — students will see the variation (or lack thereof) in real time. --- # Live Demo: Temperature in Action .small[ ```python PROMPT = "Name a color." TEMPERATURES = [0.0, 0.3, 1.0] RUNS_PER_TEMP = 5 for temp in TEMPERATURES: print(f"--- Temperature {temp} ---") responses = [] for i in range(RUNS_PER_TEMP): * response = client.messages.create( * model="claude-haiku-4-5-20251001", * max_tokens=50, temperature=temp, messages=[{"role": "user", "content": PROMPT}] ) text = response.content[0].text.strip() responses.append(text) print(f" Run {i+1}: {text}") unique = len(set(responses)) print(f" → {unique} unique out of {RUNS_PER_TEMP}") ``` ] .callout[**Watch the output.** Temperature 0 = identical every time. Temperature 1.0 = different every time. This is why agents use low temperature.] ??? Run temperature_demo.py. Let the output speak for itself. Count unique responses together. Point out we're using Haiku here — faster and cheaper for experiments. The temperature parameter works the same across all models. --- # Temperature for Agents For most agent tasks, you want **low temperature** — typically 0 to 0.3. -- **Why?** Because agents need to be *reliable*. When your agent reads a file and decides which tool to call next, you want it to make the same decision every time given the same context. -- > Randomness in agent decision-making is a bug, not a feature. -- Higher temperature is useful for: - Brainstorming or generating creative options - Producing varied examples - Tasks where diversity matters more than consistency .callout[**Default for agents: temperature 0 to 0.3.** You want reliability, not creativity, in the decision-making loop.] ??? [2 min] The blockquote is the key insight. Students should internalize: low temperature for agent reasoning, higher only for intentional variation. --- class: center, middle, inverse # Sampling ## Top-k and Top-p --- # Narrowing the Options Temperature controls how **random** the selection is. Top-k and top-p control **which tokens are even considered.** ??? One-line framing to distinguish sampling from temperature. --- # Top-k Sampling Top-k says: only consider the **k most probable** tokens, ignore everything else. -- - **Top-k = 1** — Only the single most likely token. Essentially deterministic. - **Top-k = 10** — Top 10 tokens. Some variety, but constrained. - **Top-k = 50** — More options, more variety. -- Think of it as **reducing the menu before ordering.** Instead of choosing from 50,000 possible next tokens, you're choosing from the top 10. ??? [2 min] The "menu" metaphor works well. Keep it intuitive — students don't need the math. --- # Top-p (Nucleus) Sampling Instead of a fixed number, top-p says: consider the smallest set of tokens whose **combined probability exceeds p.** -- - **Top-p = 0.1** — Only tokens making up 10% of the probability mass. Usually 1-3 tokens. - **Top-p = 0.9** — Tokens making up 90%. Most tokens that matter are included. -- **The advantage over top-k:** it adapts. - Model is confident (one token at 95%)? Top-p narrows to just that token. - Model is uncertain? Top-p includes more options. .info[Top-p is an adaptive menu. Top-k is a fixed menu. Top-p is generally the smarter choice.] ??? [2 min] Top-p is harder to grasp. The "adaptive vs. fixed menu" contrast helps. Don't spend too long on the math. --- # What to Use for Agents In practice, the settings are straightforward: -- | Parameter | Agent Recommendation | |---|---| | **Temperature** | 0 to 0.3 | | **Top-p** | 0.9 or default | | **Top-k** | Leave at default | -- **Don't overthink sampling parameters.** The big lever is temperature. Top-k and top-p are refinements that matter more for creative applications than for agents. .info[The important thing is understanding **what** these parameters do, so when your agent behaves erratically, you can check whether generation settings are the cause.] ??? [1 min] Practical recommendation students can follow. The big insight: temperature is the 90% lever. --- class: center, middle, inverse # Output Control --- # Max Tokens `max_tokens` sets the maximum number of tokens the model can generate in a single response. -- .split-left[ ### Set it too low Response gets cut off mid-sentence. `stop_reason` = `max_tokens`. For agents, this can mean a **tool call gets truncated** and becomes unparseable. ] .split-right[ ### Set it too high Reserving output capacity you don't need. Higher potential costs. Less room for input tokens on models with combined limits. ]
-- .callout[**Reasonable default for agents: 4096 tokens.** Increase for long outputs (code, reports). Decrease to enforce brevity.] ??? [2 min] The truncated tool call scenario is worth emphasizing — it's a real bug students will encounter. --- # Stop Sequences Stop sequences tell the model: if you generate this exact string, **stop immediately.** -- Useful for agents to enforce output format — for example, stopping after a tool call instead of continuing with commentary. -- In practice, modern APIs handle tool calling with structured formats that don't require manual stop sequences. But the concept is important: > You can control **when** the model stops, not just **how much** it generates. ??? [2 min] Mention the concept but don't over-invest. Modern tool calling APIs handle this. Students should know stop sequences exist for when they need fine-grained control. --- # Putting It All Together A well-configured agent API call: ```python response = client.messages.create( * model="claude-sonnet-4-5-20250929", * max_tokens=4096, * temperature=0, system=system_prompt, messages=conversation_history ) ``` Low temperature for reliability. Reasonable max_tokens. System prompt and conversation history providing context. -- **That's it.** These are the settings you'll use for most agents in this course. .callout[Don't get lost in parameter tuning. **Low temperature, reasonable max_tokens, good context.** The context matters far more than the parameters.] ??? [2 min] The highlighted lines show what's new in this lecture. The closing callout is critical: context engineering > parameter tuning. --- # Live Demo: `generation_config.py` A complete, well-configured agent call that analyzes code: .small[ ```python AGENT_CONFIG = { "model": "claude-sonnet-4-5-20250929", "max_tokens": 4096, "temperature": 0, } SYSTEM_PROMPT = """You are a code analysis assistant. When asked about code, explain clearly and concisely. If you identify a bug, state the line, the problem, and the fix.""" response = client.messages.create( **AGENT_CONFIG, system=SYSTEM_PROMPT, messages=[{"role": "user", "content": "What happens if I call average([])?\n\n" "```python\ndef average(numbers):\n" " total = 0\n for n in numbers:\n" " total += n\n return total / len(numbers)\n```" }] ) ``` ] ??? [2 min] Run generation_config.py. The model reliably identifies the ZeroDivisionError. Run it twice to show temperature=0 gives the same analysis both times. Point out the config dict pattern — this is how you'll structure agent configurations. --- # Key Takeaways -- **1. Temperature is the big lever** Low (0-0.3) for agent reliability. Higher only when you intentionally want variation. -- **2. Sampling parameters are secondary** Top-p and top-k refine token selection. Default values are fine for most agent work. -- **3. Context matters more than parameters** Getting the context right is the 10x improvement. Parameters are the 1.1x improvement. ??? Three clean takeaways. The third point sets up the next lecture on in-context learning and context engineering. --- # Coming Up Next **Lecture 3.4: In-Context Learning and the Limits of Prompting** How in-context learning works and why it's fundamental to effective prompt design. ??? Brief transition to the next lecture.