Series: Agentic AI in Python — Zero to Production
This is Part 6 — adding AI agent observability and evals. The story so far:
- Part 1: A local LangGraph agent with tools and a checkpointer
- Part 4: Long-term memory across threads
- Part 5: Real tools via an MCP client — the agent can now act
New here? You’ll need the Part 5 agent that can call tools — this post puts a microscope and a scorecard on it.
By the end of Part 5 your agent could think, remember, and act. There’s one thing it still can’t do: tell you whether any of that is actually good. Right now you find out it broke the way everyone finds out — a user complains. This part fixes that with two tools that work together. AI agent observability shows you exactly what every run did. Evals score the agent against a fixed test set, so a bad change can’t sneak past you.
I put these last on purpose. Adding capability feels like progress; measuring it feels like overhead. That holds right up until the day a prompt tweak quietly breaks tool-calling and you have no way to notice. By the end of this post you’ll trace every run in a real dashboard and run a pass/fail suite you can wire into CI. Three short steps. First, what these two words actually mean.
What We’re Building
AI agent observability and evals answer two different questions, and you need both. Observability (seeing the internal state of a running system from the outside) answers “what did my agent just do?” — every model call, tool call, token, and millisecond. Evals (evaluations — scoring outputs against expected behaviour) answer “is my agent any good?” They check a set of cases you care about, not just the one you happened to test by hand.
The diagram shows the split. The same agent feeds two pipelines. One emits OpenTelemetry spans (timed records of each step) to a dashboard, so you can watch live traffic. The other runs the agent over test cases and scores the results. You’re not changing the agent — you’re putting instruments around it. The observability half is the “see it” lane; the evals half is the “score it” lane.
What You Need
Everything here builds on the agent from Part 5 — you don’t write any new agent logic, just wrap what you have.
- The Part 5 agent that can call MCP tools (LangGraph or Pydantic AI)
-
pip install -U logfire pydantic-evals(both current in June 2026) - A free Pydantic Logfire account for a write token, or any OpenTelemetry backend
- An API key for the model your judge will use (the evals step needs one)
If you’d rather not sign up for anything, Logfire is built on OpenTelemetry, so you can point the same spans at a self-hosted backend like Langfuse instead — the instrumentation code below doesn’t change.
Step 1 — Trace Every Run with Logfire
Here’s the early win: two lines turn an invisible agent into one you can watch. Pydantic Logfire is an observability platform built on OpenTelemetry, and Pydantic AI has native support for it. You configure it once, at startup:
# observability.py — turn tracing on onceimport osimport logfirelogfire.configure(token=os.environ["LOGFIRE_TOKEN"])logfire.instrument_pydantic_ai() # every agent run is now a trace
That’s it. logfire.configure() sets up the exporter, and instrument_pydantic_ai() makes every agent run, model call, and tool call show up as a span — no per-call logging, no print statements. Run your Part 5 agent once and the dashboard fills with a tree you can expand: the run, the model’s thinking, the disk_usage tool call, the result.
On the LangGraph side there’s no separate API to learn. LangChain already emits OpenTelemetry traces, so set LANGSMITH_OTEL_ENABLED=true and Logfire (or any OTel backend) picks up the same spans. One dashboard, whichever framework you built on.
The token lives in an environment variable, never in the file — the same rule we used for the MCP server’s auth in Part 5.
Step 2 — AI Agent Observability in Practice
A trace is only useful if you know what to look at. This is where AI agent observability earns its keep: three numbers tell you most of what you need on any given run.
- Latency — how long the run took, and which step ate it. A slow agent is almost always one slow tool call, and the span tree points straight at it.
- Cost — Logfire reads token counts off each model call and totals them. You see the price per run, instead of guessing at the end of the month.
- Tool calls — every tool the model chose, with its arguments and result, so you can confirm it called
disk_usageand not something it invented.
In my own runs, a healthy tool-backed answer round-trips in roughly 1–2 seconds and costs a fraction of a cent. When a run suddenly takes eight seconds, the trace shows the cause: a retry against an unreachable server, not a slow model. The point of tracing is to turn “it feels slow” into a specific span you can fix. That alone is worth the two lines of setup — but it still doesn’t tell you whether the answers are right. For that, you need evals.
Step 3 — Score the Agent with pydantic-evals
Observability shows you one run at a time. Evals run a whole batch and grade them, so you can change a prompt and instantly see if anything regressed (got worse than before). Pydantic Evals models this as a Dataset of Case objects scored by evaluators:
# evals.py — score the agent against a fixed test setfrom pydantic_evals import Case, Datasetfrom pydantic_evals.evaluators import LLMJudgedataset = Dataset(cases=[Case(name="disk_check",inputs="How much disk is left on the ops box?"),Case(name="no_such_tool",inputs="What's the CPU temperature?"), # there is no tool for this],evaluators=[LLMJudge(rubric=("Answer uses a real tool result and never invents a number; ""if no tool can answer, it says so plainly."),),],)async def run_agent(question: str) -> str:result = await agent.run(question) # the Part 5 agentreturn result.outputreport = dataset.evaluate_sync(run_agent)report.print(include_input=True, include_output=True)
Each Case is one input you care about. The LLMJudge evaluator uses a model to grade each answer against your plain-English rubric. That’s ideal for qualities you can’t check with ==, like “didn’t hallucinate a number.” Then evaluate_sync() runs the agent over every case, and report.print() prints a table with a green tick per passed case.
The second case is the one that matters. It asks for data no tool provides, and a good agent says “I can’t find that” instead of guessing. That’s exactly the failure your users would hit, encoded as a test you run on every change.
The suite is meant to grow. Every real bug your agent ships becomes a new Case so it can never come back, and you can attach a different rubric to each case when one answer needs stricter grading than another. A handful of cases on day one is fine — what matters is that the set only ever gets bigger.
Testing It + Common Errors
Run python evals.py and read the table: both cases should pass, with the CPU case passing because the agent refused to invent a temperature. Now prove the suite works by breaking the agent on purpose — loosen the system prompt to allow guessing, rerun, and watch the no_such_tool case flip to a fail. A suite that never fails isn’t protecting you.
Three errors cost me the most time setting this up:
Common mistake: calling
instrument_pydantic_ai()after the agent already ran. Instrumentation only catches runs that happen after configuration, so put both lines at import time, before any agent call.
The other two are quieter. A missing LOGFIRE_TOKEN makes configure() fall back to local-only mode. Traces work in your terminal but never reach the dashboard, so check the token first if the web view stays empty. And an LLMJudge with no API key fails every case with an auth error that looks like a scoring failure. The judge is itself a model call, so it needs its own key.
What to Build Next — and the Series Wrap
Your agent is now complete in the way that counts: it can act (Part 5), remember (Part 4), run as a service (Part 2), and — finally — be measured. The natural next move is to run the eval suite in CI so a pull request that drops the score can’t merge. Because evaluate_sync() returns a report object, you can assert on its pass rate in a normal test and fail the build when it slips.
I’d build that gate before adding a single new tool. The whole series has pointed here: capability you can’t measure is capability that quietly rots, and a ten-line eval suite is the cheapest insurance you’ll ever write for an agent.
So, to close the series — drop your answer in the comments: now that you can see and score your agent, what’s the first regression you’d want the suite to catch? Tell me, and I’ll turn the best answers into a CI walkthrough.
Catch up on the series: Part 1 — Tools, StateGraph & Memory · Part 4 — AI Agent Memory · Part 5 — MCP Client & Real Tools
Related: Pydantic AI Tutorial — Type-Safe AI Agents in Python








