Measuring and Managing Context

Module 4, Lecture 4.1 | Section 2: Working with LLMs in Practice

Context windows have limits — but quality degrades before you hit them. This lecture makes context growth visible: how to count tokens per API call, where tokens actually accumulate during an agent task (spoiler: tool results dominate), and how to set a budget threshold that lets you act before the window fills. You'll also see empirically, through a live code demo, how a simple five-turn conversation can accumulate thousands of input tokens in minutes.

Read the full lecture narrative

Additional Resources

Lecture slides
Messages — Claude API Reference — Official documentation for the usage object returned on every API response, with input_tokens and output_tokens fields.
Token Counting — Claude API Docs — Anthropic's guide to estimating token counts before sending a request, using the count_tokens API.
Lost in the Middle: How Language Models Use Long Contexts (Liu et al., 2023) — The foundational paper establishing that LLM performance degrades for content positioned in the middle of long contexts, with highest accuracy when relevant information appears at the beginning or end.
Anthropic Pricing — Per-token costs for input and output across all Claude models, making the input/output price differential concrete.
Prompt Caching — Claude API Docs — How to cache repeated prompt prefixes to reduce both cost and latency for stable context.
Context Rot — Chroma Research — A 2025 empirical study of 18 frontier models showing performance degrades measurably as context grows, well before the window limit, due to attention dilution.
RULER: What's the Real Context Size of Your Long-Context Language Models? (Hsieh et al., 2024) — NVIDIA/Stanford benchmark demonstrating that advertised context length and effective context length often differ significantly; most models underperform at their claimed maximum.