How LLMs Actually Work

This lecture builds the mental model you need as an agent developer. You do not need to become a deep learning researcher to build effective agents, but you do need to understand how LLMs process input, generate output, and why certain design decisions matter when assembling context. The goal is practical understanding — enough to reason about LLM behavior and make better engineering decisions.

Neural Networks and Transformers

An LLM is a neural network — a mathematical function with billions of adjustable numbers called parameters or weights. Text goes in (converted to numbers), and a prediction comes out. During training, the weights are adjusted so predictions improve over time. That is the entire concept: data in, prediction out, weights tuned to reduce error.

Everything the model "knows" about language, code, and reasoning is stored in those weights. There is no database, no lookup table, no separate knowledge store. Billions of numbers, learned from patterns in training data, encode the model's entire understanding of the world.

Neural networks have existed for decades, but what changed with LLMs is scale — billions of parameters trained on enormous datasets. The architecture that made this scale practical is the transformer, introduced in 2017.

The Sequential Bottleneck

Before transformers, language models were sequential. They processed text one word at a time, left to right, passing a compressed summary forward at each step. The problem: by the time the model reached the end of a sentence, earlier words had been compressed through so many steps that distant connections were lost. In a sentence like "The capital of France is ___," a sequential model might lose the connection between "capital" and "France" if they were separated by enough intervening text, potentially confusing "capital" with an unrelated meaning.

How Transformers Changed the Game

The transformer's key innovation is attention — a mechanism that lets the model look at all parts of the input simultaneously and decide which parts are relevant to each other. Instead of reading text sequentially, the transformer processes the entire input at once and builds connections between any pair of tokens, regardless of how far apart they are.

Every major LLM — GPT, Claude, Llama, Gemini — is a transformer. The differences between them are in size, training data, and fine-tuning, not in fundamental architecture. The transformer design itself is not proprietary; the intellectual property lies primarily in the training data and the specific training procedures each organization uses.

Next-Token Prediction

LLMs have one fundamental operation: they predict the next token. Given a sequence of tokens, the model produces a probability distribution over what comes next. It selects one token, appends it to the sequence, and repeats — generating text one token at a time, left to right, with each new token conditioned on everything that came before it.

When an LLM produces a paragraph of text, that paragraph was generated through this iterative process. Every output token requires a full pass through the model, which is why output tokens are computationally more expensive than input tokens in API pricing.

What Tokens Are

Tokens are not words. They are sub-word pieces — fragments that the model has learned are useful building blocks for representing language. Common words like "the" or "and" map to single tokens. Less common words get split into pieces: "tokenization" might become "token" + "ization." Code tokenizes differently from prose — variable names, brackets, and operators each consume their own tokens.

A rough rule of thumb: one token is approximately three-quarters of a word in English, though this varies with writing style and content type.

For agent developers, tokens matter in two ways. First, they are the unit of cost — API providers charge per token for both input and output. Second, they fill up the context window, and that window is finite. Every tool result, every conversation message, every system prompt instruction consumes tokens. Tokens are the budget you manage throughout every agent interaction.

Attention: The Key Mechanism

If the model is just predicting the next token, how does it produce coherent paragraphs, follow instructions, or connect information from earlier in a long conversation? The answer is attention.

The Intuition

When deciding what token comes next, the model does not weigh all previous tokens equally. It attends to the ones most relevant to the current prediction. Consider the sequence "The capital of France is ___." When predicting what fills that blank, the model attends heavily to "capital" and "France" — those tokens carry the signal. It attends less to "The" and "of" — those are structural but not informative for this particular prediction.

This happens across every token in the context. Every token can attend to every other token, which is what gives transformers their power over sequential architectures.

Query, Key, and Value

Attention works through three learned components for each token:

Query (Q): Encodes what information this token needs — "what am I looking for to predict what comes next?"
Key (K): Encodes what information this token represents — "here is what I am about."
Value (V): Carries the actual content — "here is the information I contribute."

The model compares each token's Query against every other token's Key to compute relevance scores, then pulls a weighted combination of the matching Values. In the "capital of France" example, the token "is" generates a Query seeking a factual answer. The Keys for "capital" (geography) and "France" (specific place) score high, while "The" and "of" score low. The blended Values from the high-scoring tokens shape the probability distribution, which heavily favors "Paris" — an association the model learned from training data.

Attention Heads and Layers

A single Query/Key/Value pass captures one type of relationship. But language has many types of relationships occurring simultaneously — factual associations, grammatical structure, word order, stylistic patterns, and more.

An attention head is a single Q/K/V pass with its own learned weights. A frontier LLM typically has 64 to 128 attention heads per layer. The model designer chooses how many heads to include (a computation budget decision), but what each head learns to specialize in emerges from training. Researchers have found that different heads reliably learn different functions — some track syntax, others track coreference, others track factual associations — but these specializations are statistical patterns, not explicitly programmed categories.

These attention heads are organized into layers, and a frontier LLM stacks 80 to 120 layers. Each layer's output feeds into the next:

Early layers capture basic patterns — grammar, word proximity, parts of speech.
Middle layers build relationships — "capital" relates to "France" as a geographic fact.
Later layers combine these into complex reasoning — producing "Paris" as the prediction.

With 80+ layers of 100+ heads each, the model runs thousands of Q/K/V passes over the input. Each head within a layer can run in parallel, which is why LLMs run on GPUs — hardware optimized for massive parallel computation.

Why Attention Matters for Agent Development

Attention is a finite resource. The model has a fixed capacity for how much it can attend to, determined by its architecture and training. As context grows longer, the model's ability to focus on the right information degrades.

Three practical consequences:

Hard token limits. Context windows have a maximum size — 128K, 200K, or more tokens depending on the model. You cannot exceed this limit.
Quality degrades before the limit. Even well within the token limit, more context means attention is spread across more tokens. The model has more to sort through, and relevance scoring becomes less precise.
The "lost in the middle" problem. Research has shown that LLMs attend more strongly to information at the beginning and end of the context window than to information in the middle. This is a probabilistic tendency, not a hard rule, but it has direct implications for how you structure prompts and conversation history.

Every decision an agent developer makes — what to include in context, what to leave out, when to summarize, when to truncate — is fundamentally about managing this attention budget.

The Context Window as Architecture

When an agent makes an API call to an LLM, it sends a context window. That window contains everything the model has to work with:

System prompt — who the agent is, what tools it has, how it should behave. This is provided by the developer, not the user, and typically prefixes every API call.
Conversation history — every message exchanged so far in the session.
Tool results — file contents, search results, database queries, anything the agent has gathered through tool use.
Current user message — what the human is asking right now.

The model sees all of this, reasons over all of this, and predicts its response based on all of this. Nothing else. There is no memory outside this window, no hidden state, no background knowledge beyond what is encoded in the model weights.

LLMs Are Stateless

This is a critical concept that trips up developers coming from interactive tools like ChatGPT. An LLM has exactly one operation: take text in, produce text out. Each API call is independent. The model has no memory of previous calls — no session, no stored state, no continuity between requests.

If a conversation feels continuous, that is because the application — the agent code you write — re-sends the entire conversation history as input on every call. ChatGPT the product stores your conversation and re-submits it each time. The underlying model has no idea it has spoken to you before. The application manages state; the model does not.

This is the agent engineer's job: assemble the right context — system prompt, conversation history, tool results, user message — and pass all of it on every single call. The model will not do this for you. It cannot. It is a stateless function.

Some APIs offer prompt caching, which reuses computation for repeated prefixes across calls to reduce cost and latency. This is an infrastructure optimization. It does not change the fundamental model — every call is still a complete, independent, stateless function. Caching saves money; it does not introduce memory.

The Context Engineering Challenge

Context engineering is at the heart of agent development. The agent developer curates the information the LLM reasons over, and the quality of that curation directly determines the quality of the output.

Too much context — attention gets diluted, quality drops, costs rise. A context window stuffed with a large file leaves little attention capacity for the user's actual question.
Too little context — the model does not have what it needs and will hallucinate to fill gaps. Hallucinations are not lies; they are the model's best statistical guess given insufficient information.
Wrong context — irrelevant information leads the model astray, producing responses that are confidently incorrect.
Right context — the model performs remarkably well. The results can seem almost magical when the context is precisely what the model needs.

The sweet spot — the smallest possible set of high-signal tokens — is what much of agent engineering is about learning to find. This is the agent developer's primary design space and the central constraint that shapes every architectural decision in agent systems.

Looking Ahead

This lecture covered what LLMs do: they predict the next token, using attention to focus on relevant parts of a finite context window. The next lecture examines how commercial LLMs are trained — and why their training data explains many of the behaviors you have likely already noticed when working with them.