How LLMs Are Trained

The previous lecture covered what LLMs do — predict the next token using attention over a finite context window. This lecture addresses the equally important question: how did they learn to do it? The answer turns out to be one of the most practically useful things an agent developer can understand, because the training process directly explains the specific, predictable ways LLMs behave. Once you understand how LLMs were trained, you can anticipate their strengths, weaknesses, and quirks — and design agents that account for them.

The Three-Stage Training Pipeline

Commercial LLMs are built through a three-stage process: pre-training, instruction tuning, and reinforcement learning from human feedback (RLHF). Each stage shapes the model's behavior in distinct and predictable ways.

Stage 1: Pre-Training

Pre-training is the foundation. The model reads an enormous corpus of text — a significant portion of the publicly available internet, plus books, journals, and other digital text — and learns to predict the next token. It performs this prediction billions of times across trillions of tokens. By the end, the model's weights encode the patterns, knowledge, and style of that training data.

The key insight is straightforward: the model's behavior is a reflection of its training data. Not of intelligence. Not of understanding. Of patterns in the data it was trained on. Once you internalize this, many LLM behaviors that seem mysterious start making perfect sense.

Why LLM Prose Is Dramatic

People write differently on the internet than they speak in conversation. Blog posts, articles, and social media are written to grab attention — sensational, emphatic, emotionally charged. The internet rewards attention-grabbing language. When an LLM generates text, it draws on these patterns. The slightly breathless tone, the overuse of words like "crucial," "revolutionary," and "game-changing" — that is the internet's voice, absorbed during pre-training.

The LLM is not excited. The training data was written by people who were trying to get clicks.

Why LLMs Are Good at Code

LLMs are remarkably capable at writing code, and the reason is that the training data for code is exceptionally well-structured. GitHub and other open-source repositories contain millions of public codebases — working code, written well enough that someone published it.

But it goes deeper than the code itself. GitHub also contains issues, pull requests, commit messages, and code reviews. An issue describes a problem in natural language. A commit or pull request provides the code that fixes it. This amounts to millions of examples of a human-written prompt paired with the code that solves it — which is exactly the task we ask LLMs to perform.

Code also has an inherent advantage as context for language models. Programming languages are structured and precise — a small number of tokens expressing intent with high signal density. This aligns well with how attention works: fewer tokens with higher signal produce better predictions.

Why LLMs Over-Engineer

People do not just write code — they write about code. Blog posts, tutorials, conference talks, and social media discussions about programming tend to showcase interesting solutions. Nobody writes a blog post about a simple for loop. They write about the clever library, the elegant design pattern, the cutting-edge framework.

The result is that the training data contains disproportionately many sophisticated, complex solutions and relatively fewer examples of the boring, simple approach that would have been fine. When an LLM writes a function, it may reach for a library or pattern that is impressive but unnecessary. It is not showing off. It is reflecting the bias in what gets discussed online.

Stage 2: Instruction Tuning

A raw pre-trained model has absorbed vast knowledge but is not useful as an assistant. If you type a question, it might generate another question — because on the internet, questions are often followed by more questions, not answers.

Instruction tuning is the second stage. Humans write thousands of example conversations: a question or instruction paired with a helpful, well-structured response. The model is fine-tuned on these examples to learn the pattern of receiving a request and producing a useful answer.

This stage teaches the model to:

Answer questions instead of just continuing text
Follow formatting instructions ("give me a list," "write in JSON")
Stay on topic instead of wandering
Structure responses with introductions and conclusions

Side Effects of Instruction Tuning

The training examples used for instruction tuning tend to be thorough, well-structured, and comprehensive. The humans writing them were rewarded for being helpful. This creates two predictable side effects.

Verbosity. The model learns that a good response is a long response. Ask a simple yes-or-no question and you get a paragraph — because in the instruction tuning data, that is what "helpful" looked like.

Agreeableness. The instruction tuning data models cooperative, helpful behavior. The model rarely saw examples of pushing back, saying "no," or telling the user they were wrong. So it develops a tendency to go along with whatever the user says, even when the user is mistaken.

For agent developers, this matters because you can counteract these tendencies through system prompts. Instructions like "be concise" or "push back if the user's approach has problems" add a layer on top of instruction tuning — but you can only write effective countermeasures if you understand the defaults you are overriding.

Stage 3: Reinforcement Learning from Human Feedback (RLHF)

The third stage is RLHF. Humans are shown pairs of model responses to the same prompt and asked which one is better. The model is then trained to produce more responses like the ones humans preferred.

This is where the model learns style and judgment. It already knows how to follow instructions from stage 2. RLHF teaches it to do so in the way that humans find most satisfying. But human preferences introduce their own biases, and these produce specific behavioral patterns that agent developers encounter constantly.

Sycophancy

Human evaluators tend to prefer responses that agree with them. They rate agreeable answers higher than ones that push back. The model learns that agreement is rewarded. This creates the sycophancy problem — the model tells you what you want to hear, even if you are wrong.

For agents, this is particularly dangerous. If an agent is supposed to review code and flag problems, but it has been trained to be agreeable, it may praise bad code instead of criticizing it. Agent developers need to work against this in system prompts by explicitly instructing the model to provide honest assessments.

Hedging

Human evaluators penalize responses that turn out to be wrong. The model learns to hedge — "it depends," "there are several approaches," "this may vary." This is often appropriate, but it can make agents indecisive when you need them to commit to a specific action or recommendation.

The Confident Error

Sometimes RLHF produces the opposite effect from hedging. A well-structured, confident response that happens to be factually wrong can rate higher with human evaluators than a hesitant response that is correct — because humans tend to conflate confidence with accuracy. The model learns this pattern too.

The consequence: when an LLM hallucinates, it does so with conviction. It is not deliberately lying. It is making a bad prediction (generating a token that does not reflect reality), but it has also learned that confident delivery is preferred. The combination of a wrong prediction and a confident tone produces the hallucinations that are factually incorrect yet stated with authority.

The Training Data Detective Framework

The practical takeaway from understanding the training pipeline is a framework for reasoning about LLM behavior:

When an LLM does something surprising — good or bad — ask: what was in the training data that would produce this behavior?

This framework applies broadly:

LLMs translate well between programming languages — because GitHub contains many implementations of similar projects in different languages, plus discussions about porting code.
LLMs struggle with recent information — because the training data has a cutoff date. The model has never seen information published after that date.
LLMs are good at cover letters and marketing copy — because there are enormous amounts of this content online, with many examples of polished versions.
LLMs sometimes produce code with subtle security issues — because most code tutorials and blog posts focus on functionality, not security. The insecure pattern is more common in the training data than the secure one.
LLMs prefer popular libraries over niche ones — because popular libraries appear far more frequently in the training data. The more common a pattern is on the internet, the better the LLM will be at reproducing it.

Why This Matters for Agent Development

Understanding the training pipeline is not trivia — it is a practical engineering skill that applies every time you write a system prompt, design a tool, or decide how much to trust your agent's output.

As an agent developer, this mental model lets you:

Predict where an LLM will excel and where it will struggle — before you build. If the task is well-represented in the training data (common code patterns, standard writing tasks), the model will likely perform well. If it is niche, recent, or requires knowledge that was sparse in the training data, expect problems.
Design context that compensates for training biases. If the model defaults to verbose responses, instruct it to be concise. If it defaults to popular frameworks, provide examples using the specific tools you need. If it lacks information about a niche library, put that information in the context — there is no other way the model will know about it.
Set appropriate autonomy. Give more autonomy for tasks where the training data was strong and the model is reliable. Reduce autonomy and add human oversight for tasks where training data was sparse or biased.
Debug intelligently. When your agent misbehaves — generates bizarre tool calls, makes confusing decisions, produces overconfident nonsense — you can reason about why based on what the training data likely contained, rather than treating the model as an inscrutable black box.

Looking Ahead

This lecture covered how LLMs are trained and why the training process explains their behavior. The next lecture examines what bridges the gap between a language model and an agent: tool calling, system prompts, and the agent loop that turns next-token prediction into real-world action.