How LLMs Are Trained

Module 2, Lecture 2.2 | Introduction to Agentic Systems

This lecture explains the three-stage training pipeline behind commercial LLMs — pre-training, instruction tuning, and RLHF — and shows how each stage produces specific, predictable behaviors. Pre-training on internet text explains dramatic prose and over-engineered code. Instruction tuning introduces verbosity and agreeableness. RLHF creates sycophancy, hedging, and confident errors. The lecture introduces a "training data detective" framework: when an LLM does something surprising, ask what was in the training data that would produce that behavior. This is a practical engineering skill for predicting LLM strengths and weaknesses, designing effective context, and debugging agent behavior.

Read the full lecture narrative

Additional Resources

Lecture slides
Training language models to follow instructions with human feedback (Ouyang et al., 2022) — The InstructGPT paper that established the three-stage SFT → reward model → RLHF pipeline
Illustrating Reinforcement Learning from Human Feedback (RLHF) — Hugging Face's visual explainer of the RLHF process, covering reward model training and PPO optimization
LLM Training: RLHF and Its Alternatives — Sebastian Raschka's overview of RLHF, DPO, and other post-training alignment techniques
Aligning language models to follow instructions (OpenAI Blog) — OpenAI's blog post announcing InstructGPT, written for a general audience