Implementing Tools and Safety

In Lecture 6.1 we built the agent loop with stub tools. The control flow works — the model issues tool calls, receives results, and decides what to do next — but the stubs do nothing. This lecture replaces them with real filesystem operations. Along the way we add two things the earlier code in 5.3 did not have: directory safety validation that prevents the agent from escaping its sandbox, and the confirmation pattern for destructive operations.

The framing for the whole lecture is that tools are the only path through which an agent can affect the world. The LLM cannot read a file, write to disk, or call an API on its own — it can only emit a request that your tool code interprets and executes. That is both the source of the danger and the source of the safety. Whatever the model is permitted to do, it is permitted because of code you wrote. The stories of agents "running amok" almost always trace back to a developer who removed the guardrails — Claude Code's --dangerously-skip-permissions flag is the canonical example. The agent isn't malicious; the tools simply weren't constrained.

The Tool Contract

Every tool follows the same contract: accept arguments, return a string. The returned string becomes a tool_result content block in the next API call. This applies regardless of what the tool does:

If the tool fails, the error message goes into context the same way a success message would. The model reads it, understands the failure, and can adapt — retry with different arguments, take a different approach, or report the error to the user.

This leads to the most important principle in the lecture: error handling is communication with the model. A clear error string is actionable. A vague error string wastes a tool call. A crash gives the model nothing.

The right mental model is to imagine the LLM as a programmer who has to read your error messages to figure out what went wrong. Just as you sigh when an API returns "Error: invalid input" with no detail, the model is poorly served by the same kind of message. The difference is that the model does not get tired — you can put quite a lot of explanation in an error string and it will read all of it. (Within reason: every character is context budget.) Specific, descriptive errors with examples of correct usage are how you turn a failed tool call into a successful retry.

A related principle is that tools should treat their arguments as if coming from a potentially hostile caller. In ordinary Python code we might assume a path argument is a sensible string; in agent code we cannot. The model will sometimes call tools incorrectly while exploring their behavior, and a malicious user may try to coax the model into misusing them. Tools must validate, catch every plausible exception, and never crash.

The Three Tools

The lecture builds three tools that, together, are sufficient for a simple coding agent: list_files, read_file, and edit_file. The complete implementations appear in the lecture slides; this section discusses the design decisions behind each.

list_files

def list_files(path):
    """List files and directories at the given path."""
    path = validate_path(path)
    try:
        entries = os.listdir(path)
        lines = []
        for entry in sorted(entries):
            full = os.path.join(path, entry)
            kind = "dir" if os.path.isdir(full) else "file"
            lines.append(f"{entry}  [{kind}]")
        return "\n".join(lines) if lines else "(empty directory)"
    except FileNotFoundError:
        return f"Error: directory not found: {path}"
    except PermissionError:
        return f"Error: permission denied: {path}"

Three design choices matter:

  1. Sorted output. Determinism is a feature for tool design. The model sees the same ordering every time it lists the same directory. Predictable tools are easier for the model to use well.
  2. Type annotations. Each entry is tagged [file] or [dir] so the model can decide what to explore further without an extra round-trip.
  3. Errors as return values, not raised exceptions. The two specific exceptions that os.listdir can produce are caught and converted into descriptive strings. The model needs to read the error, not have it crash the agent loop.

read_file

def read_file(filename):
    """Read the complete contents of a file."""
    filename = validate_path(filename)
    try:
        with open(filename, "r", encoding="utf-8") as f:
            return f.read()
    except FileNotFoundError:
        return f"Error: file not found: {filename}"
    except PermissionError:
        return f"Error: permission denied: {filename}"
    except UnicodeDecodeError:
        return f"Error: file is not readable as text: {filename}"

This implementation is deliberately naive — the entire file enters the context. Lab 4 replaces it with search_file plus read_lines, which return only what the model asks for. The point of the simpler version is to make a working agent first; the optimization comes after.

The UnicodeDecodeError catch is the one non-obvious detail. Without it, attempting to read a binary file (an image, a compiled .pyc, a .zip) raises an exception that the generic dispatcher would catch with an unhelpful generic message. With it, the model gets a specific error and can tell the user "that file is binary, I can't read it."

edit_file

def edit_file(path, old_str, new_str):
    """Create or edit a file using string replacement."""
    path = validate_path(path)
    if old_str == "":
        dir_name = os.path.dirname(path)
        if dir_name:
            os.makedirs(dir_name, exist_ok=True)
        with open(path, "w", encoding="utf-8") as f:
            f.write(new_str)
        return f"Created {path}"
    else:
        try:
            with open(path, "r", encoding="utf-8") as f:
                contents = f.read()
        except FileNotFoundError:
            return f"Error: file not found: {path}. Read the file first."
        if old_str not in contents:
            return f"Error: text not found in {path}"
        updated = contents.replace(old_str, new_str, 1)
        with open(path, "w", encoding="utf-8") as f:
            f.write(updated)
        return f"Edited {path}"

This tool has two modes selected by whether old_str is empty: create mode writes a new file, edit mode does string replacement. Three details deserve attention:

  1. os.makedirs(dir_name, exist_ok=True) in create mode. Creating src/utils/helpers.py should work even if src/utils/ does not yet exist — without this, the model would have to make a separate "mkdir" call first, which it cannot because we did not give it one.
  2. contents.replace(old_str, new_str, 1) in edit mode. The 1 means only the first occurrence is replaced. If the model wants a different occurrence it must supply a longer, more unique old_str. This prevents a single edit from accidentally rewriting code in multiple places — the same behavior Claude Code's edit tool uses.
  3. The error message "Read the file first." When the file is missing, the error string itself reinforces the workflow rule. The model is much more likely to follow the correct sequence (read → edit) when the error tells it to than when only the system prompt does.

Directory Safety: From Suggestion to Enforcement

The system prompt typically tells the model "Only access files within the current project directory." The natural question is: what happens if the user asks it to read /etc/passwd or ../../secrets.env?

Without tool-level enforcement, the answer is: it depends. The model might refuse, citing the system prompt. It might not. Prompt compliance is probabilistic — it degrades under pressure, under long contexts, under repeated user requests, and under adversarial framing. Models are trained to be helpful, and a sufficiently insistent user can wear them down. Relying on a system prompt to keep the agent inside its directory is not a security strategy; it is a hope.

Tool enforcement is the deterministic alternative. It puts the rule in code that the model cannot bypass.

import os

ALLOWED_DIR = os.path.realpath(os.getcwd())

def validate_path(path):
    """Resolve a path and verify it falls within the allowed directory.

    Returns the resolved absolute path.
    Raises ValueError if the path escapes the allowed directory.
    """
    resolved = os.path.realpath(os.path.join(ALLOWED_DIR, path))
    if not resolved.startswith(ALLOWED_DIR + os.sep) and resolved != ALLOWED_DIR:
        raise ValueError(f"Access denied: {path} is outside the project directory")
    return resolved

Three details make this work:

  1. os.path.realpath resolves symlinks and .. components. The path ../../etc/passwd becomes /etc/passwd, which clearly does not start with the project directory. URL-encoded paths, Unicode tricks, and other encoding games all collapse to the same canonical form, so they cannot sneak past the check. This also means validate_path correctly handles symlinks — a symlink inside the project that points to /etc/ resolves to a path under /etc/, which fails the directory check.
  2. os.sep in the prefix check. Without the separator, a directory like /home/project-backup would pass the prefix check for /home/project because the strings start the same way. Appending the separator forces the comparison to land on a directory boundary.
  3. raise ValueError, not return an error string. The function is called inside every tool. If it raises, the dispatcher's outer try/except catches it and returns the error to the model. The default behavior is denial — if a tool ever forgot to wrap the call, the operation would fail rather than succeed silently.

The dispatcher pulls this together:

def dispatch_tool(name, inputs):
    """Execute a tool and return its result as a string."""
    try:
        if name == "list_files":
            return list_files(**inputs)
        elif name == "read_file":
            return read_file(**inputs)
        elif name == "edit_file":
            return edit_file(**inputs)
        else:
            return f"Error: unknown tool: {name}"
    except ValueError as e:
        return str(e)
    except Exception as e:
        return f"Error: {type(e).__name__}: {e}"

The model receives "Access denied: ../../etc/passwd is outside the project directory" as a tool result and can report the restriction to the user. It cannot circumvent the rule, no matter what it is told.

Defense in Depth

The system prompt and tool enforcement are not redundant — they serve different purposes:

Prompt constraint Tool enforcement
Mechanism Behavioral guidance Code execution
Reliability Probabilistic Deterministic
Bypass Model may ignore under pressure Cannot be bypassed by the model
Feedback Model may not mention the violation Returns explicit error string
When to use Shape default behavior Enforce security boundaries

The prompt constraint discourages the model from attempting prohibited operations in the first place — saving wasted tool calls and tokens. The tool enforcement catches the cases where the model tries anyway. Together they form three layers of defense:

  1. Outer layer: the system prompt — behavioral guidance.
  2. Middle layer: validate_path inside each tool — deterministic enforcement.
  3. Inner layer: the dispatcher's catch-all — fail-safe for any unhandled exception.

The principle from Lecture 5.3 still holds in its strongest form: the absence of a delete tool makes deletion impossible. But for tools that do exist, the tool itself must enforce its boundaries. Prompts are advisory; tool code is authoritative.

A useful security analogy: prompts are like written policies, and tools are like access controls. A policy can say "do not access these files." An access control can prevent it. Real systems have both — and so should agents.

The Confirmation Pattern

Some operations are destructive and cannot be undone: deleting files, overwriting databases, sending email, making payments. For these, the right behavior is not to act on the first request but to confirm first. Lab 4 implements delete_file with this pattern.

There are two ways to confirm. The first — and the one the lecture's example illustrates — is to make the tool itself respond differently based on a confirmed flag:

def delete_file(path, confirmed=False):
    """Delete a file. Requires confirmation."""
    path = validate_path(path)
    if not confirmed:
        return (f"About to delete {path}. "
                f"Call delete_file again with confirmed=True to proceed.")
    os.remove(path)
    return f"Deleted {path}"

The first call returns a warning instead of acting. The model receives the warning as a tool result, surfaces it to the user as text, and ends its turn. When the user confirms, a new turn begins; the model calls the tool again with confirmed=True, and only then is the file removed.

The elegance of this approach is that the existing agent loop carries the back-and-forth automatically. No special "confirmation mode" is needed — the model emits text after the warning, the inner loop exits on end_turn, and the outer loop waits for the next user message. Five steps:

  1. Model calls delete_file(path="important.py")confirmed defaults to False.
  2. Tool returns the warning string — it appears in messages as a tool_result.
  3. Model reads the warning, surfaces it to the user as text, and ends its turn.
  4. User responds with confirmation — a new outer loop iteration begins.
  5. Model calls delete_file(path="important.py", confirmed=True) — file is deleted.

The second approach is to have the tool communicate directly with the user — through the agent's UI rather than through the LLM. The tool itself displays a confirmation dialog, blocks until the user responds, and either proceeds or returns a "user declined" message to the model. This is generally preferable when feasible: it removes the LLM as a middle layer in a security-relevant decision and gives the user a clearer interaction. But it is not always feasible — agents that run on a network, in batch, or without a live user have no UI to talk to. In those cases, the LLM round-trip pattern is the fallback.

In either implementation, the underlying point is that tools can take control. The same principles that govern any user interface — confirmation for destructive actions, clear messaging, sensible defaults — still apply when there is an LLM sitting between the user and your code. The LLM does not change the rules; it is just another caller, and one that is occasionally unreliable.

Key Takeaways