Module 5, Lecture 5.1 | Section 3: Prompt and Context Engineering
Modules 2 through 4 built a mental model of how LLMs work: they are next-token predictors, trained on internet text, shaped by reinforcement learning from human feedback. This lecture puts that model to work. The question is no longer what LLMs are, but how do you ask them things effectively?
Prompting is not a collection of magic phrases. It is an applied discipline grounded in what we know about how models were trained and how they process context. The principles here are directly applicable to agent development, where prompt failures are not annoyances — they are bugs.
The most impactful prompting principle is also the most obvious: vague prompts produce vague results.
An LLM is a completion engine. It continues whatever pattern the prompt establishes. A vague or underspecified prompt establishes an ambiguous pattern, and the model fills the ambiguity with whatever was most statistically common in its training data — which often does not match the intent of the request.
Consider the difference between:
"Summarize this article."
and:
"Write a three-sentence summary of this article, focusing on the main technical claims and their limitations."
The first leaves length, focus, and structure entirely open. The second constrains all three. The model has far less ambiguity to resolve, and the resulting output is far more predictable.
For agents, predictability is not a preference — it is a requirement. Agents parse responses programmatically. A response format that is consistent 80% of the time will break the code that processes it. Format constraints are a core engineering tool, not a stylistic preference.
"Do X" reliably outperforms "don't do Y."
The training data insight from Lecture 2.2 applies here: the model has seen far more examples of things being done than things being avoided. Negative instructions require the model to hold a constraint in mind during generation — a cognitively expensive operation that is more often violated.
| Weaker (negative) | Stronger (positive) |
|---|---|
| "Don't give me a list." | "Write your response as a single paragraph." |
| "Don't be verbose." | "Keep your response under 100 words." |
| "Don't use jargon." | "Use plain language suitable for a non-technical reader." |
The structural reason this works: a positive instruction defines what the output looks like. A negative instruction defines what to avoid, leaving the space of valid outputs large and underspecified. The model has more room to sample something that technically satisfies the constraint but misses the intent.
An occasional "do this, not that" formulation is fine — but only paired with a concrete positive example. The positive is the actual constraint; the negative is just emphasis.
Four constraint types reliably improve compliance across model families:
{...}" / "Use a markdown table."Constraints work because they make the completion task more deterministic. The model has a smaller valid output space to sample from, and each token generated is more strongly conditioned by the prior context.
Four failure modes appear repeatedly in first-draft prompts and agent system prompts.
Conflicting guidance. "Be concise but thorough" or "be creative but precise" creates irresolvable tension. The model resolves it by splitting the difference — and produces something that is neither. Each instruction should have one clear goal.
Instruction overload. A system prompt with 40 behavioral rules is worse than one with 10. The model's attention is finite. Critical instructions buried in a long list are violated more often than the same instructions standing alone. The goal is a focused, minimal prompt — not the most comprehensive one you can write.
Buried key information. Research on long-context attention shows that LLMs attend most strongly to the beginning and end of the context window. Content in the middle receives systematically less attention. Critical instructions belong at the top of a system prompt, not in paragraph eight of a long document.
Assuming shared context. What is obvious to the developer is not automatically obvious to the model. Decisions made earlier in a conversation attenuate as the conversation grows. If a decision made three turns ago needs to govern the current response, restate it. This is especially important in agents, where sessions grow long and early context degrades.
Lecture 2.2 covered the behavioral side effects of RLHF (Reinforcement Learning from Human Feedback) — the training phase that shapes how models respond, not just what they know. The biases it introduces are real and systematic. Prompt design is the first line of defense.
| Bias | Counter-instruction |
|---|---|
| Verbosity | "Keep your response under 150 words. Do not add explanation beyond what was asked." |
| Sycophancy | "Do not agree with me if I am wrong. Correct me directly." |
| Over-engineering | "Implement the simplest solution that works. No extra abstractions." |
| Hedging | "State your answer directly. Do not qualify unless the uncertainty is material." |
These counter-instructions work because they explicitly override statistical priors established during training. They are not tricks — they are corrections applied at the prompt level to behavior that was baked in during model development.
Over-engineering is particularly important to address in coding agents. Without an explicit simplicity constraint, models trained on blog posts and open-source repositories tend to produce heavily abstracted code with extensive error handling, type annotations, and design patterns the task does not require.
As prompts grow more complex — containing instructions, context, examples, and user input simultaneously — the model must infer which part of the prompt is which. XML tags make these boundaries explicit.
<instructions>
You are a code reviewer. Identify bugs and style violations.
</instructions>
<code>
def get_user(id):
return db.query("SELECT * FROM users WHERE id=" + id)
</code>
Without tags, the model must infer the boundary between instruction and content. With tags, it is explicit. In an agent system prompt, where a single message may contain tool documentation, behavioral rules, few-shot examples, and dynamically injected context, explicit delineation is load-bearing — not cosmetic.
The tags do not need to be valid XML. There is no schema requirement. Short, descriptive tag names are sufficient: <instructions>, <context>, <example>, <user_input>. Anthropic's training specifically reinforces XML-tagged structure, but the principle generalizes to any model.
Headers, bullet lists, and code blocks are not just visual formatting — they signal structure the model was trained to recognize. A system prompt written as organized markdown sections is more reliably followed than the same content written as prose paragraphs.
#, ##) indicate section boundaries and help the model scope which instructions apply where.The markdown is semantic. The model has been trained on vast amounts of markdown-formatted content and treats structural markers as signals about intent — not just about visual appearance.
To understand why chain-of-thought (CoT) prompting works, it helps to recall what "autoregressive next-token predictor" means in practice. The model generates one token at a time. Each token it produces is appended to the context, and that extended context becomes the input for generating the next token. The model is, in a sense, reading its own output as it writes.
Without CoT, the model must produce a correct answer directly. All reasoning is implicit — compressed into the final answer tokens in one shot. For a simple question, this is fine. For a multi-step problem, any error in an implicit intermediate step propagates silently to the answer.
When you instruct the model to reason step by step, it first outputs a plan — a sequence of intermediate reasoning steps. Those tokens become part of the context for all subsequent generation. The model is conditioning each answer token not just on the original prompt, but on an explicit, visible reasoning trace. For multi-step problems, this dramatically improves accuracy: errors at step N are in context when generating step N+1, rather than hidden.
This is why CoT helps. It does not make the model more capable — it gives the model its own reasoning as context, so each step can build correctly on the last.
Use CoT when:
Skip CoT when:
CoT adds real tokens. In an agent with 50 tool calls per session, 50 tokens of reasoning per call adds 2,500 tokens to context per session. At 10 sessions per day, that is 25,000 tokens daily — 750,000 per month. The benefit is real; so is the cost. Use CoT for tasks where intermediate reasoning visibly reduces errors, and measure before committing.
Claude's extended thinking mode applies the same mechanism: the model generates intermediate reasoning tokens before the final answer, but those tokens appear in a hidden block rather than inline in the response. The principle is identical; the difference is whether the reasoning is visible in the output.
A prompt template is a reusable prompt structure with placeholders for variable content:
REVIEW_TEMPLATE = """
<instructions>
Review the following code for bugs, style violations, and security issues.
Return a JSON array of findings with keys: type, severity, line, description.
</instructions>
<code>
{code}
</code>
"""
Templates enforce consistency across agent runs. They make prompts reviewable: a template in a .py file is diffable, testable, and version-controllable in ways that a hardcoded string is not.
Treat prompts as code. Store them in version control. When a prompt change improves results, commit it with a descriptive message — the same way you would a code change. When a prompt change causes regressions, revert it. The prompts powering an agent are arguably as important as the agent code itself: a model that behaves incorrectly because of a bad prompt produces incorrect outputs without raising an exception.
Do XML tags need to be valid XML? No. The tags are delimiters, not a schema. Short, descriptive labels in angle brackets are sufficient. The model does not validate the XML — it uses the tag names as structural signals.
How long should a system prompt be? Every token in the system prompt is re-sent on every API call. A 2,000-token system prompt adds 2,000 input tokens per turn. The goal is the minimum system prompt that produces the behavior you need — not the most thorough one you can write. Lecture 5.2 covers system prompt token budgeting directly.
Does chain-of-thought make the model smarter? No. CoT improves performance on tasks that benefit from explicit intermediate reasoning, but it does not increase the model's underlying capability. On simple tasks, CoT adds noise, not signal. The improvement is largest on math and multi-step logical reasoning, where intermediate steps catch compounding errors.
Are XML tags specific to Claude? Anthropic's training specifically reinforces XML-tagged structure, so the technique is particularly effective with Claude. The underlying principle — explicit structure reduces parsing ambiguity — generalizes to other models, though specific techniques may need tuning.