This lecture builds the mental model you need as an agent developer. You do not need to become a deep learning researcher to build effective agents, but you do need to understand how LLMs process input, generate output, and why certain design decisions matter when assembling context. The goal is practical understanding — enough to reason about LLM behavior and make better engineering decisions.
An LLM is a neural network — a mathematical function with billions of adjustable numbers called parameters or weights. Text goes in (converted to numbers), and a prediction comes out. During training, the weights are adjusted so predictions improve over time. That is the entire concept: data in, prediction out, weights tuned to reduce error.
Everything the model "knows" about language, code, and reasoning is stored in those weights. There is no database, no lookup table, no separate knowledge store. Billions of numbers, learned from patterns in training data, encode the model's entire understanding of the world.
Neural networks have existed for decades, but what changed with LLMs is scale — billions of parameters trained on enormous datasets. The architecture that made this scale practical is the transformer, introduced in 2017.
Before transformers, language models were sequential. They processed text one word at a time, left to right, passing a compressed summary forward at each step. The problem: by the time the model reached the end of a sentence, earlier words had been compressed through so many steps that distant connections were lost. In a sentence like "The capital of France is ___," a sequential model might lose the connection between "capital" and "France" if they were separated by enough intervening text, potentially confusing "capital" with an unrelated meaning.
The transformer's key innovation is attention — a mechanism that lets the model look at all parts of the input simultaneously and decide which parts are relevant to each other. Instead of reading text sequentially, the transformer processes the entire input at once and builds connections between any pair of tokens, regardless of how far apart they are.
Every major LLM — GPT, Claude, Llama, Gemini — is a transformer. The differences between them are in size, training data, and fine-tuning, not in fundamental architecture. The transformer design itself is not proprietary; the intellectual property lies primarily in the training data and the specific training procedures each organization uses.
LLMs have one fundamental operation: they predict the next token. Given a sequence of tokens, the model produces a probability distribution over what comes next. It selects one token, appends it to the sequence, and repeats — generating text one token at a time, left to right, with each new token conditioned on everything that came before it.
When an LLM produces a paragraph of text, that paragraph was generated through this iterative process. Every output token requires a full pass through the model, which is why output tokens are computationally more expensive than input tokens in API pricing.
Tokens are not words. They are sub-word pieces — fragments that the model has learned are useful building blocks for representing language. Common words like "the" or "and" map to single tokens. Less common words get split into pieces: "tokenization" might become "token" + "ization." Code tokenizes differently from prose — variable names, brackets, and operators each consume their own tokens.
A rough rule of thumb: one token is approximately three-quarters of a word in English, though this varies with writing style and content type.
For agent developers, tokens matter in two ways. First, they are the unit of cost — API providers charge per token for both input and output. Second, they fill up the context window, and that window is finite. Every tool result, every conversation message, every system prompt instruction consumes tokens. Tokens are the budget you manage throughout every agent interaction.
If the model is just predicting the next token, how does it produce coherent paragraphs, follow instructions, or connect information from earlier in a long conversation? The answer is attention.
When deciding what token comes next, the model does not weigh all previous tokens equally. It attends to the ones most relevant to the current prediction. Consider the sequence "The capital of France is ___." When predicting what fills that blank, the model attends heavily to "capital" and "France" — those tokens carry the signal. It attends less to "The" and "of" — those are structural but not informative for this particular prediction.
This happens across every token in the context. Every token can attend to every other token, which is what gives transformers their power over sequential architectures.
Attention works through three learned components for each token:
The model compares each token's Query against every other token's Key to compute relevance scores, then pulls a weighted combination of the matching Values. In the "capital of France" example, the token "is" generates a Query seeking a factual answer. The Keys for "capital" (geography) and "France" (specific place) score high, while "The" and "of" score low. The blended Values from the high-scoring tokens shape the probability distribution, which heavily favors "Paris" — an association the model learned from training data.
A single Query/Key/Value pass captures one type of relationship. But language has many types of relationships occurring simultaneously — factual associations, grammatical structure, word order, stylistic patterns, and more.
An attention head is a single Q/K/V pass with its own learned weights. A frontier LLM typically has 64 to 128 attention heads per layer. The model designer chooses how many heads to include (a computation budget decision), but what each head learns to specialize in emerges from training. Researchers have found that different heads reliably learn different functions — some track syntax, others track coreference, others track factual associations — but these specializations are statistical patterns, not explicitly programmed categories.
These attention heads are organized into layers, and a frontier LLM stacks 80 to 120 layers. Each layer's output feeds into the next:
With 80+ layers of 100+ heads each, the model runs thousands of Q/K/V passes over the input. Each head within a layer can run in parallel, which is why LLMs run on GPUs — hardware optimized for massive parallel computation.
Attention is a finite resource. The model has a fixed capacity for how much it can attend to, determined by its architecture and training. As context grows longer, the model's ability to focus on the right information degrades.
Three practical consequences:
Every decision an agent developer makes — what to include in context, what to leave out, when to summarize, when to truncate — is fundamentally about managing this attention budget.
When an agent makes an API call to an LLM, it sends a context window. That window contains everything the model has to work with:
The model sees all of this, reasons over all of this, and predicts its response based on all of this. Nothing else. There is no memory outside this window, no hidden state, no background knowledge beyond what is encoded in the model weights.
This is a critical concept that trips up developers coming from interactive tools like ChatGPT. An LLM has exactly one operation: take text in, produce text out. Each API call is independent. The model has no memory of previous calls — no session, no stored state, no continuity between requests.
If a conversation feels continuous, that is because the application — the agent code you write — re-sends the entire conversation history as input on every call. ChatGPT the product stores your conversation and re-submits it each time. The underlying model has no idea it has spoken to you before. The application manages state; the model does not.
This is the agent engineer's job: assemble the right context — system prompt, conversation history, tool results, user message — and pass all of it on every single call. The model will not do this for you. It cannot. It is a stateless function.
Some APIs offer prompt caching, which reuses computation for repeated prefixes across calls to reduce cost and latency. This is an infrastructure optimization. It does not change the fundamental model — every call is still a complete, independent, stateless function. Caching saves money; it does not introduce memory.
Context engineering is at the heart of agent development. The agent developer curates the information the LLM reasons over, and the quality of that curation directly determines the quality of the output.
The sweet spot — the smallest possible set of high-signal tokens — is what much of agent engineering is about learning to find. This is the agent developer's primary design space and the central constraint that shapes every architectural decision in agent systems.
This lecture covered what LLMs do: they predict the next token, using attention to focus on relevant parts of a finite context window. The next lecture examines how commercial LLMs are trained — and why their training data explains many of the behaviors you have likely already noticed when working with them.