Working with the API

Module 2 established the conceptual foundation: what LLMs are, how they're trained, and what bridges the gap from language model to agent. This lecture makes the transition from theory to practice. It covers the mechanics of working with an LLM through its API — what you send, what you get back, how conversations are maintained, and what to do when things go wrong.

Where Is the LLM?

Before making an API call, a practical question: where does the model run?

Local models. Open-weight models (DeepSeek, Qwen, Mistral, and others) can be downloaded and run on your own hardware using tools like Ollama. This gives you full control — no API keys, no usage fees, no data leaving your machine. The tradeoff is capability: local models are limited by your hardware, and the most capable frontier models are too large to run on consumer machines. Local inference is a good fit for privacy-sensitive work, offline use, and experimentation with fine-tuning.

Provider-hosted models. The dominant approach for building agents today. Providers like Anthropic, OpenAI, and Google host models on their infrastructure. You send requests over HTTPS and pay per token. This gives you access to the most capable models available — Opus, GPT-4, Gemini — with no local hardware requirements. Cost scales with usage rather than upfront investment.

This course uses the Anthropic API in Python, but the choice of provider and language is not fundamental. The concepts — messages, roles, tool calling, streaming — transfer directly to any provider's API. OpenAI, Google, Cohere, and Mistral all expose the same structural patterns. Higher-level libraries like LangChain wrap multiple providers behind a common interface. The principle is: learn one API well, and the patterns apply everywhere.

Anatomy of an API Call

An LLM API call has a clear structure. Here is a basic call using the Anthropic Python SDK:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system="You are a helpful coding assistant.",
    messages=[
        {"role": "user", "content": "What does the map() function do in Python?"}
    ]
)

Each parameter serves a specific purpose:

model specifies which LLM to use. Different models have different capabilities, context window sizes, speeds, and costs. The model identifier (here, claude-sonnet-4-6) refers to a specific version of a specific model tier from a specific provider.

max_tokens sets the output budget — the maximum number of tokens the model can generate in its response. Set it too low and the response gets truncated. Set it too high and you pay for capacity you don't use.

system is the system prompt discussed in Lecture 2.3. It establishes identity, tools, and behavioral guidelines. Some APIs (like Anthropic's) accept the system prompt as a separate parameter; others include it as the first message in the messages array. The effect is the same.

messages is the conversation history — an array of messages with roles and content. This array is the context window in practice. Everything the model knows about the current conversation comes through this array.

The complete script is in hello_api.py.

The Response Object

The API returns a structured response with three important fields:

response.content      # The model's text response (or tool use request)
response.stop_reason  # Why it stopped: "end_turn", "max_tokens", "tool_use"
response.usage        # Token counts: input_tokens, output_tokens

content contains the model's output. In most cases, this is text. When tools are available, it may instead contain a structured tool call request.

stop_reason tells your code how to handle the response. end_turn means the model finished naturally — this is the final response to pass to the user. tool_use means the model is requesting a tool call — this is the signal to execute the tool and continue the agent loop. max_tokens means the response was cut off before the model finished — you may need to increase the budget or handle the truncation.

usage reports token consumption: how many input tokens were sent and how many output tokens were generated. This is essential for cost tracking and for detecting when you're approaching context window limits.

Message Roles

Every message in the conversation has a role that shapes how the model interprets its content:

system messages are instructions that frame everything. The model treats them as foundational operating guidelines. Typically set once and not changed during a conversation.

user messages come from the human — or from your agent's orchestration code. The model interprets these as requests, questions, or instructions to act on. Importantly, "user" messages are not always from a human: when your agent feeds tool results back to the model, those are sent as user messages too.

assistant messages are the model's own previous responses. Including these in the messages array shows the model what it said before, maintaining conversational continuity.

Conversation History as Context

Every API call to an LLM is stateless. The model has no memory of previous calls. When you send a request, the model sees only what is in the messages array — nothing else. This means that maintaining a conversation requires your code to manage the messages array, appending each new user message and each assistant response.

Consider a simple multi-turn exchange:

messages = [
    {"role": "user", "content": "What's the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."},
    {"role": "user", "content": "What's its population?"}
]

The model can resolve "its" to "Paris" because the prior exchange is in context. Remove the first two messages and the model has no idea what "its" refers to.

For a simple chat, this is straightforward. For agents, it gets interesting fast. Agent conversations grow with every tool call and every tool result:

messages = [
    {"role": "user", "content": "Fix the bug in main.py"},
    {"role": "assistant", "content": "[tool_use: read_file('main.py')]"},
    {"role": "user", "content": "[tool_result: contents of main.py...]"},
    {"role": "assistant", "content": "I see the issue. [tool_use: edit_file(...)]"},
    {"role": "user", "content": "[tool_result: file edited successfully]"},
    {"role": "assistant", "content": "I've fixed the off-by-one error..."}
]

A single user request might generate 10, 20, or 50 messages before the agent finishes. Every message adds to the context. The entire history is re-sent with every API call, so input token counts grow with every turn. This is why context management — covered in depth later in the course — is a central concern of agent development.

The conversation.py script demonstrates this directly: a three-turn conversation where you can watch the input token count climb from 32 to 50 to 93 as the context accumulates.

Streaming Responses

By default, you send a request and wait for the complete response. But LLMs generate tokens one at a time, and you can receive them as they are generated — this is streaming.

Streaming matters for three reasons:

User experience. Text appears in real time instead of a blank screen followed by a wall of text.
Time to first token. The first token arrives in milliseconds, even if the full response takes seconds to generate.
Agent decisions. You can detect tool calls as they begin, rather than waiting for the entire response.

For agent loops where a human is not watching the output, streaming is optional — waiting for the complete response and parsing it is often simpler. For user-facing applications, streaming is almost always the right choice.

Error Handling

API calls fail. They fail more often than most developers expect, especially under load. The common failure modes:

Rate limits (HTTP 429). You have sent too many requests in a given time window. Providers throttle usage to manage server load. The solution is to wait and retry.

Overloaded (HTTP 529). The service is temporarily at capacity. Same treatment as rate limits — back off and retry.

Timeouts. The model is taking too long, typically due to complex prompts or very large contexts. Set reasonable timeouts and retry if appropriate.

Malformed responses. The model's output is not in the format you expected — text when you expected a tool call, or invalid JSON. Validate responses before acting on them.

The standard pattern for handling transient errors is exponential backoff: wait longer after each successive failure.

def call_with_retry(messages, system="", max_retries=5):
    for attempt in range(max_retries):
        try:
            response = client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=1024, system=system, messages=messages
            )
            return response
        except anthropic.RateLimitError:
            wait_time = 2 ** attempt  # 1s, 2s, 4s, 8s, 16s
            time.sleep(wait_time)
        except anthropic.APIStatusError as e:
            if e.status_code == 529:
                wait_time = 2 ** attempt
                time.sleep(wait_time)
            else:
                raise
    raise Exception(f"Max retries ({max_retries}) exceeded")

This pattern — retry on transient errors, fail fast on permanent errors — belongs in every production agent. The complete implementation is in retry_pattern.py.

Key Takeaways

Three things to carry forward from this lecture:

The API call has a clear anatomy. Model, system prompt, and messages go in. Content, stop reason, and usage come out. This structure is consistent across providers.
The messages array is the context window. You manage it yourself — appending user messages, assistant responses, and tool results. Every API call is stateless; your code is responsible for maintaining conversational continuity.
Build error handling from day one. Retries with exponential backoff, response validation, and graceful failure handling are not optional for production agents.