HomeAboutOur TeamContact
HomeArtificial Intelligence
AI Agent Observability & Evals in Python (Part 6)

AI Agent Observability & Evals in Python (Part 6)

Artificial Intelligence
June 18, 2026
6 min read
Intermediate
📚 Part of the series: Agentic AI in Python: Zero to Production
AI agent observability and evals dashboard — an agent run traced and scored against a test set, Part 6 of the Agentic AI in Python series
Table of Contents
01
What We're Building
02
What You Need
03
Step 1 — Trace Every Run with Logfire
04
Step 2 — AI Agent Observability in Practice
05
Step 3 — Score the Agent with pydantic-evals
06
Testing It + Common Errors
07
What to Build Next — and the Series Wrap

Series: Agentic AI in Python — Zero to Production

This is Part 6 — adding AI agent observability and evals. The story so far:

  • Part 1: A local LangGraph agent with tools and a checkpointer
  • Part 4: Long-term memory across threads
  • Part 5: Real tools via an MCP client — the agent can now act

New here? You’ll need the Part 5 agent that can call tools — this post puts a microscope and a scorecard on it.

By the end of Part 5 your agent could think, remember, and act. There’s one thing it still can’t do: tell you whether any of that is actually good. Right now you find out it broke the way everyone finds out — a user complains. This part fixes that with two tools that work together. AI agent observability shows you exactly what every run did. Evals score the agent against a fixed test set, so a bad change can’t sneak past you.

I put these last on purpose. Adding capability feels like progress; measuring it feels like overhead. That holds right up until the day a prompt tweak quietly breaks tool-calling and you have no way to notice. By the end of this post you’ll trace every run in a real dashboard and run a pass/fail suite you can wire into CI. Three short steps. First, what these two words actually mean.

What We’re Building

AI agent observability and evals answer two different questions, and you need both. Observability (seeing the internal state of a running system from the outside) answers “what did my agent just do?” — every model call, tool call, token, and millisecond. Evals (evaluations — scoring outputs against expected behaviour) answer “is my agent any good?” They check a set of cases you care about, not just the one you happened to test by hand.

AI agent observability and evals architecture: the agent built in Parts 1 to 5 emits OpenTelemetry spans to Logfire for traces, cost and tool-call inspection, and is run against pydantic-evals test cases to produce a pass or fail report

The diagram shows the split. The same agent feeds two pipelines. One emits OpenTelemetry spans (timed records of each step) to a dashboard, so you can watch live traffic. The other runs the agent over test cases and scores the results. You’re not changing the agent — you’re putting instruments around it. The observability half is the “see it” lane; the evals half is the “score it” lane.

What You Need

Everything here builds on the agent from Part 5 — you don’t write any new agent logic, just wrap what you have.

  • The Part 5 agent that can call MCP tools (LangGraph or Pydantic AI)
  • pip install -U logfire pydantic-evals (both current in June 2026)
  • A free Pydantic Logfire account for a write token, or any OpenTelemetry backend
  • An API key for the model your judge will use (the evals step needs one)

Six-step flow for adding observability and evals to a Python agent: start from the Part 5 agent, install Logfire and pydantic-evals, configure and instrument, run and read traces, write evaluation cases, score the agent, and catch regressions before users do

If you’d rather not sign up for anything, Logfire is built on OpenTelemetry, so you can point the same spans at a self-hosted backend like Langfuse instead — the instrumentation code below doesn’t change.

Step 1 — Trace Every Run with Logfire

Here’s the early win: two lines turn an invisible agent into one you can watch. Pydantic Logfire is an observability platform built on OpenTelemetry, and Pydantic AI has native support for it. You configure it once, at startup:

python
# observability.py — turn tracing on once
import os
import logfire
logfire.configure(token=os.environ["LOGFIRE_TOKEN"])
logfire.instrument_pydantic_ai() # every agent run is now a trace

That’s it. logfire.configure() sets up the exporter, and instrument_pydantic_ai() makes every agent run, model call, and tool call show up as a span — no per-call logging, no print statements. Run your Part 5 agent once and the dashboard fills with a tree you can expand: the run, the model’s thinking, the disk_usage tool call, the result.

On the LangGraph side there’s no separate API to learn. LangChain already emits OpenTelemetry traces, so set LANGSMITH_OTEL_ENABLED=true and Logfire (or any OTel backend) picks up the same spans. One dashboard, whichever framework you built on.

The token lives in an environment variable, never in the file — the same rule we used for the MCP server’s auth in Part 5.

Step 2 — AI Agent Observability in Practice

A trace is only useful if you know what to look at. This is where AI agent observability earns its keep: three numbers tell you most of what you need on any given run.

  • Latency — how long the run took, and which step ate it. A slow agent is almost always one slow tool call, and the span tree points straight at it.
  • Cost — Logfire reads token counts off each model call and totals them. You see the price per run, instead of guessing at the end of the month.
  • Tool calls — every tool the model chose, with its arguments and result, so you can confirm it called disk_usage and not something it invented.

In my own runs, a healthy tool-backed answer round-trips in roughly 1–2 seconds and costs a fraction of a cent. When a run suddenly takes eight seconds, the trace shows the cause: a retry against an unreachable server, not a slow model. The point of tracing is to turn “it feels slow” into a specific span you can fix. That alone is worth the two lines of setup — but it still doesn’t tell you whether the answers are right. For that, you need evals.

Step 3 — Score the Agent with pydantic-evals

Observability shows you one run at a time. Evals run a whole batch and grade them, so you can change a prompt and instantly see if anything regressed (got worse than before). Pydantic Evals models this as a Dataset of Case objects scored by evaluators:

python
# evals.py — score the agent against a fixed test set
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import LLMJudge
dataset = Dataset(
cases=[
Case(name="disk_check",
inputs="How much disk is left on the ops box?"),
Case(name="no_such_tool",
inputs="What's the CPU temperature?"), # there is no tool for this
],
evaluators=[
LLMJudge(
rubric=("Answer uses a real tool result and never invents a number; "
"if no tool can answer, it says so plainly."),
),
],
)
async def run_agent(question: str) -> str:
result = await agent.run(question) # the Part 5 agent
return result.output
report = dataset.evaluate_sync(run_agent)
report.print(include_input=True, include_output=True)

Each Case is one input you care about. The LLMJudge evaluator uses a model to grade each answer against your plain-English rubric. That’s ideal for qualities you can’t check with ==, like “didn’t hallucinate a number.” Then evaluate_sync() runs the agent over every case, and report.print() prints a table with a green tick per passed case.

The second case is the one that matters. It asks for data no tool provides, and a good agent says “I can’t find that” instead of guessing. That’s exactly the failure your users would hit, encoded as a test you run on every change.

The suite is meant to grow. Every real bug your agent ships becomes a new Case so it can never come back, and you can attach a different rubric to each case when one answer needs stricter grading than another. A handful of cases on day one is fine — what matters is that the set only ever gets bigger.

Testing It + Common Errors

Run python evals.py and read the table: both cases should pass, with the CPU case passing because the agent refused to invent a temperature. Now prove the suite works by breaking the agent on purpose — loosen the system prompt to allow guessing, rerun, and watch the no_such_tool case flip to a fail. A suite that never fails isn’t protecting you.

Three errors cost me the most time setting this up:

Common mistake: calling instrument_pydantic_ai() after the agent already ran. Instrumentation only catches runs that happen after configuration, so put both lines at import time, before any agent call.

The other two are quieter. A missing LOGFIRE_TOKEN makes configure() fall back to local-only mode. Traces work in your terminal but never reach the dashboard, so check the token first if the web view stays empty. And an LLMJudge with no API key fails every case with an auth error that looks like a scoring failure. The judge is itself a model call, so it needs its own key.

What to Build Next — and the Series Wrap

Your agent is now complete in the way that counts: it can act (Part 5), remember (Part 4), run as a service (Part 2), and — finally — be measured. The natural next move is to run the eval suite in CI so a pull request that drops the score can’t merge. Because evaluate_sync() returns a report object, you can assert on its pass rate in a normal test and fail the build when it slips.

I’d build that gate before adding a single new tool. The whole series has pointed here: capability you can’t measure is capability that quietly rots, and a ten-line eval suite is the cheapest insurance you’ll ever write for an agent.

So, to close the series — drop your answer in the comments: now that you can see and score your agent, what’s the first regression you’d want the suite to catch? Tell me, and I’ll turn the best answers into a CI walkthrough.

Catch up on the series: Part 1 — Tools, StateGraph & Memory · Part 4 — AI Agent Memory · Part 5 — MCP Client & Real Tools

Related: Pydantic AI Tutorial — Type-Safe AI Agents in Python


References

  1. Pydantic Logfire — AI & LLM Observability
  2. Logfire — Pydantic AI integration
  3. Pydantic Evals — documentation
  4. Pydantic Evals — LLM Judge evaluator
  5. OpenTelemetry — Semantic conventions for generative AI

Tags

#AIAgentObservability#AgenticAI#PydanticLogfire#LLMEvals#PythonTutorial#AIForDevelopers

Share

Previous Article
Call an LLM in Python: The First Building Block of an Agent
Sukhveer Kaur
More from this author

Sukhveer Kaur

Part 0 banner for Python for AI agents showing a messages list and a while loop over a dark background with the title The Basics to Read the Code
Python for AI Agents: The Basics to Read the Code (Part 0)
June 18, 2026
5 min
Beginner
See all by Sukhveer Kaur

Get new guides in your inbox

Practical AI, software engineering, and cloud articles — straight to your inbox. No spam, unsubscribe anytime.
AI Agent Observability & Evals in Python (Part 6)
6 min left
Sukhveer Kaur

Sukhveer Kaur

Software Developer & AI Engineer

Popular Posts

01
Python for AI Agents: The Basics to Read the Code (Part 0)
Artificial Intelligence
·
5 min read

Table Of Contents

1
What We're Building
2
What You Need
3
Step 1 — Trace Every Run with Logfire
4
Step 2 — AI Agent Observability in Practice
5
Step 3 — Score the Agent with pydantic-evals
6
Testing It + Common Errors
7
What to Build Next — and the Series Wrap

Related Posts

© 2026, All Rights Reserved.

Quick Links

Advertise with usOur TeamAbout UsEditorial StandardsContact Us

Social Media