Module 4, Lecture 4.1 | Section 2: Working with LLMs in Practice
Context engineering — curating the smallest possible set of high-signal tokens — is the principle introduced in Lecture 3.4. This lecture is about its prerequisite: measurement. You cannot manage context you cannot see. The goal here is to make context growth visible: how to count tokens as they accumulate, where they come from, and when the quantity starts to hurt quality.
Every major LLM API reports token usage per call. In the Anthropic SDK, each response object includes a usage field:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": "Explain recursion in two sentences."}]
)
print(f"Input: {response.usage.input_tokens} tokens")
print(f"Output: {response.usage.output_tokens} tokens")
These are reported per call. Across a multi-turn session, you accumulate them yourself.
Output tokens cost roughly three to five times more than input tokens, and the reason is architectural. Input tokens are processed in parallel — the entire input is known upfront and can be passed through the model in a single forward pass. Output tokens are generated sequentially: each token requires a full forward pass through the model because each token depends on all preceding tokens (the autoregressive generation mechanism from Lecture 2.1).
Sequential processing means more GPU compute per token. That cost is reflected in the price. The practical implication: keeping output concise is an effective cost-reduction strategy, not just an aesthetic preference.
Individual API calls use surprisingly few tokens. The accumulation happens across turns. Consider a simple user request: "Fix the bug in parser.py." That sentence is maybe ten tokens. But here is what actually flows through the context window over a realistic agent session:
| Step | Event | Approximate Tokens |
|---|---|---|
| 1 | System prompt (always present) | ~500 |
| 2 | User message | ~10 |
| 3 | Agent reads parser.py |
~800 |
| 4 | Agent reads the test file | ~600 |
| 5 | Agent proposes edit | ~200 |
| 6 | Tool confirms edit | ~50 |
| 7 | Agent re-reads parser.py to verify |
~800 |
| 8 | Agent runs tests | ~400 |
| 9 | Tests fail; agent reads error output | ~300 |
| 10 | Agent makes another edit, re-reads, re-runs | ~1,500 |
Running total: ~5,160 tokens from a single user request.

Notice step 7: the agent reads the same file twice, and both copies stay in context. The model does not remember that it already read the file — every tool result is re-sent on every subsequent call. That is the fundamental reason context grows so fast.
In a typical agent session, the token breakdown is roughly:

Tool results dominate. Every file read, command output, and search result is appended to the context and re-sent on every future call. This is why token-efficient tool design (covered in Lecture 4.3) has such a large impact — it is the biggest lever available.
The script context_growth.py simulates a five-turn coding assistant session and tracks token accumulation per step:
# After each API call, track cumulative tokens
cumulative_input += response.usage.input_tokens
cumulative_output += response.usage.output_tokens
print(f"Step {i:<3} in={response.usage.input_tokens:>6} "
f"out={response.usage.output_tokens:>6} "
f"cum_in={cumulative_input:>8}")
Running this produces output like:
Step 1 in= 137 out= 313 cum_in= 137
Step 2 in= 461 out= 224 cum_in= 598
Step 3 in= 703 out= 187 cum_in= 1301
Step 4 in= 912 out= 209 cum_in= 2213
Step 5 in= 1142 out= 271 cum_in= 3355
The input token count on each call is the full conversation re-sent. Output from previous steps becomes input for all subsequent steps. After five turns of a simple coding conversation, cumulative input tokens approach 4,000 — and that is without any file reading.
The complete script is in context_growth.py.
Context windows are advertised as ceilings — 128K, 200K, even 1M tokens for recent models. These are hard limits. Once your context exceeds the model's maximum, the API rejects the request. But the practical limit is lower.
Quality degrades well before the hard limit is reached. More context means the model's attention must cover more tokens, and the coverage becomes thinner. The degradation is not gradual — it tends to hold steady for a while and then drop sharply once the context becomes dense enough that the model struggles to maintain coherence.

The "Lost in the Middle" paper (Liu et al., 2023) established a specific pattern: models attend more to the beginning and end of long contexts, with weaker recall for information in the middle. Content in the middle of a long conversation is less likely to influence output, regardless of its relevance. In practice, you may notice an agent that starts a session with high precision gradually becoming less precise as the conversation grows — ignoring earlier instructions, losing track of constraints, giving less targeted answers.
The threshold where quality noticeably degrades is typically around 60–70% of the model's maximum context — not 90% or 95%. The context window is a ceiling, not a target.
A context budget is a threshold: when the current context size reaches X% of the model's limit, take action. The common default is 80%.
MAX_CONTEXT = 200_000 # model's limit
BUDGET = int(MAX_CONTEXT * 0.80) # 160,000 tokens
if response.usage.input_tokens > BUDGET:
# Time to manage context — strategies in Lecture 4.2
print(f"Context budget exceeded: {response.usage.input_tokens}/{BUDGET}")
The 80% threshold is chosen deliberately: it leaves enough headroom for the model to operate on the current context — to summarize it, compact it, or take some other action. If you wait until the hard limit, there is nothing you can do. If you act at 80%, you still have 20% available for remediation.
The specific strategies — sliding windows, selective preservation, compaction — are covered in Lecture 4.2. What matters here is establishing the habit of checking.
Three reasons to keep context small rather than expanding to the largest available window:
The goal is the smallest context that contains everything the model needs — not the largest context the model will accept.
Token tracking should be built into an agent from the beginning, not added after performance problems appear. The minimum viable instrumentation is a print statement after every API call:
print(f"[tokens] in={response.usage.input_tokens} "
f"out={response.usage.output_tokens} "
f"total={response.usage.input_tokens + response.usage.output_tokens}")
What to track:
This is a print statement at this stage. As the course progresses and we build a full agent framework, this becomes a proper TokenTracker class with history, alerting, and budget enforcement. The underlying data — response.usage.input_tokens — is the same. What changes is how it is structured and acted upon.
The key principle: token consumption is never a mystery. The API always reports exactly what was used. Context size is something you have complete visibility into — and therefore something you can manage.