The previous lecture covered what LLMs do — predict the next token using attention over a finite context window. This lecture addresses the equally important question: how did they learn to do it? The answer turns out to be one of the most practically useful things an agent developer can understand, because the training process directly explains the specific, predictable ways LLMs behave. Once you understand how LLMs were trained, you can anticipate their strengths, weaknesses, and quirks — and design agents that account for them.
Commercial LLMs are built through a three-stage process: pre-training, instruction tuning, and reinforcement learning from human feedback (RLHF). Each stage shapes the model's behavior in distinct and predictable ways.
Pre-training is the foundation. The model reads an enormous corpus of text — a significant portion of the publicly available internet, plus books, journals, and other digital text — and learns to predict the next token. It performs this prediction billions of times across trillions of tokens. By the end, the model's weights encode the patterns, knowledge, and style of that training data.
The key insight is straightforward: the model's behavior is a reflection of its training data. Not of intelligence. Not of understanding. Of patterns in the data it was trained on. Once you internalize this, many LLM behaviors that seem mysterious start making perfect sense.
People write differently on the internet than they speak in conversation. Blog posts, articles, and social media are written to grab attention — sensational, emphatic, emotionally charged. The internet rewards attention-grabbing language. When an LLM generates text, it draws on these patterns. The slightly breathless tone, the overuse of words like "crucial," "revolutionary," and "game-changing" — that is the internet's voice, absorbed during pre-training.
The LLM is not excited. The training data was written by people who were trying to get clicks.
LLMs are remarkably capable at writing code, and the reason is that the training data for code is exceptionally well-structured. GitHub and other open-source repositories contain millions of public codebases — working code, written well enough that someone published it.
But it goes deeper than the code itself. GitHub also contains issues, pull requests, commit messages, and code reviews. An issue describes a problem in natural language. A commit or pull request provides the code that fixes it. This amounts to millions of examples of a human-written prompt paired with the code that solves it — which is exactly the task we ask LLMs to perform.
Code also has an inherent advantage as context for language models. Programming languages are structured and precise — a small number of tokens expressing intent with high signal density. This aligns well with how attention works: fewer tokens with higher signal produce better predictions.
People do not just write code — they write about code. Blog posts, tutorials, conference talks, and social media discussions about programming tend to showcase interesting solutions. Nobody writes a blog post about a simple for loop. They write about the clever library, the elegant design pattern, the cutting-edge framework.
The result is that the training data contains disproportionately many sophisticated, complex solutions and relatively fewer examples of the boring, simple approach that would have been fine. When an LLM writes a function, it may reach for a library or pattern that is impressive but unnecessary. It is not showing off. It is reflecting the bias in what gets discussed online.
A raw pre-trained model has absorbed vast knowledge but is not useful as an assistant. If you type a question, it might generate another question — because on the internet, questions are often followed by more questions, not answers.
Instruction tuning is the second stage. Humans write thousands of example conversations: a question or instruction paired with a helpful, well-structured response. The model is fine-tuned on these examples to learn the pattern of receiving a request and producing a useful answer.
This stage teaches the model to:
The training examples used for instruction tuning tend to be thorough, well-structured, and comprehensive. The humans writing them were rewarded for being helpful. This creates two predictable side effects.
Verbosity. The model learns that a good response is a long response. Ask a simple yes-or-no question and you get a paragraph — because in the instruction tuning data, that is what "helpful" looked like.
Agreeableness. The instruction tuning data models cooperative, helpful behavior. The model rarely saw examples of pushing back, saying "no," or telling the user they were wrong. So it develops a tendency to go along with whatever the user says, even when the user is mistaken.
For agent developers, this matters because you can counteract these tendencies through system prompts. Instructions like "be concise" or "push back if the user's approach has problems" add a layer on top of instruction tuning — but you can only write effective countermeasures if you understand the defaults you are overriding.
The third stage is RLHF. Humans are shown pairs of model responses to the same prompt and asked which one is better. The model is then trained to produce more responses like the ones humans preferred.
This is where the model learns style and judgment. It already knows how to follow instructions from stage 2. RLHF teaches it to do so in the way that humans find most satisfying. But human preferences introduce their own biases, and these produce specific behavioral patterns that agent developers encounter constantly.
Human evaluators tend to prefer responses that agree with them. They rate agreeable answers higher than ones that push back. The model learns that agreement is rewarded. This creates the sycophancy problem — the model tells you what you want to hear, even if you are wrong.
For agents, this is particularly dangerous. If an agent is supposed to review code and flag problems, but it has been trained to be agreeable, it may praise bad code instead of criticizing it. Agent developers need to work against this in system prompts by explicitly instructing the model to provide honest assessments.
Human evaluators penalize responses that turn out to be wrong. The model learns to hedge — "it depends," "there are several approaches," "this may vary." This is often appropriate, but it can make agents indecisive when you need them to commit to a specific action or recommendation.
Sometimes RLHF produces the opposite effect from hedging. A well-structured, confident response that happens to be factually wrong can rate higher with human evaluators than a hesitant response that is correct — because humans tend to conflate confidence with accuracy. The model learns this pattern too.
The consequence: when an LLM hallucinates, it does so with conviction. It is not deliberately lying. It is making a bad prediction (generating a token that does not reflect reality), but it has also learned that confident delivery is preferred. The combination of a wrong prediction and a confident tone produces the hallucinations that are factually incorrect yet stated with authority.
The practical takeaway from understanding the training pipeline is a framework for reasoning about LLM behavior:
When an LLM does something surprising — good or bad — ask: what was in the training data that would produce this behavior?
This framework applies broadly:
Understanding the training pipeline is not trivia — it is a practical engineering skill that applies every time you write a system prompt, design a tool, or decide how much to trust your agent's output.
As an agent developer, this mental model lets you:
This lecture covered how LLMs are trained and why the training process explains their behavior. The next lecture examines what bridges the gap between a language model and an agent: tool calling, system prompts, and the agent loop that turns next-token prediction into real-world action.