Agentic AI in Python: Zero to Production · 06Intermediate

AI Agent Observability & Evals in Python (Part 6)

Add AI agent observability and evals in Python — trace every agent run with Pydantic Logfire and score it against a test set using pydantic-evals.

SK

Sukhveer Kaur

Published June 18, 2026 · Updated July 6, 2026

7 min read

Open in ChatGPT Open in Claude

On this page +

What We're Building What You Need Step 1 — Trace Every Run with Logfire Step 2 — AI Agent Observability in Practice Step 3 — Score the Agent with pydantic-evals Testing It + Common Errors What to Build Next — and the Series Wrap

🧰 New here? Set up your environment first · ~5 min

Install Python 3.11+ — confirm with python3 --version.
Create and activate a virtual environment: python3 -m venv .venv then source .venv/bin/activate (Windows: .venv\Scripts\activate). venv, pip & uv primer →
Install the packages this tutorial lists: pip install -U pip <packages>.
Put your LLM API key in a .env file and never commit it. API key + .env primer →

Full walkthrough → Environment Setup primer

🟡 Intermediate⏱️ 25 minStack: Python 3.11+, Pydantic Logfire, pydantic-evals

Series: Agentic AI in Python — Zero to Production

This is Part 6 — adding AI agent observability and evals. The story so far:

Part 1: A local LangGraph agent with tools and a checkpointer

Part 4: Long-term memory across threads

Part 5: Real tools via an MCP client — the agent can now act

New here? You’ll need the Part 5 agent that can call tools — this post puts a microscope and a scorecard on it.

By the end of Part 5 your agent could think, remember, and act. There’s one thing it still can’t do: tell you whether any of that is actually good. Right now you find out it broke the way everyone finds out — a user complains. This part fixes that with two tools that work together. AI agent observability shows you exactly what every run did. Evals score the agent against a fixed test set, so a bad change can’t sneak past you.

I put these last on purpose. Adding capability feels like progress; measuring it feels like overhead. That holds right up until the day a prompt tweak quietly breaks tool-calling and you have no way to notice. By the end of this post you’ll trace every run in a real dashboard and run a pass/fail suite you can wire into CI. Three short steps. First, what these two words actually mean.

✅ Before you start

The tool-calling agent from Part 5 — this post instruments and scores it
You’ve shipped at least Part 1’s agent and can run it end to end
An LLM API key (the eval judge is itself a model call and needs one)

🎯 Key takeaways

Observability answers “what did my agent do?”; evals answer “is it any good?” — you need both, and neither changes the agent, only instruments it.
Two lines of Logfire (configure + instrument) turn every run, model call, and tool call into a trace; LangGraph emits the same OpenTelemetry spans.
Watch three numbers per run — latency, cost (from token counts), and which tools were called — to turn “it feels slow” into a specific span you can fix.
Score the agent with pydantic-evals: a Dataset of Cases graded by an LLMJudge rubric, run on every change and wired into CI so a regression can’t merge.

What We’re Building#

AI agent observability and evals answer two different questions, and you need both. Observability (seeing the internal state of a running system from the outside) answers “what did my agent just do?” — every model call, tool call, token, and millisecond. Evals (evaluations — scoring outputs against expected behaviour) answer “is my agent any good?” They check a set of cases you care about, not just the one you happened to test by hand.

The diagram shows the split. The same agent feeds two pipelines. One emits OpenTelemetry spans (timed records of each step) to a dashboard, so you can watch live traffic. The other runs the agent over test cases and scores the results. You’re not changing the agent — you’re putting instruments around it. The observability half is the “see it” lane; the evals half is the “score it” lane.

What You Need#

Everything here builds on the agent from Part 5 — you don’t write any new agent logic, just wrap what you have.

The Part 5 agent that can call MCP tools (LangGraph or Pydantic AI)
pip install -U logfire pydantic-evals (both current in June 2026)
A free Pydantic Logfire account for a write token, or any OpenTelemetry backend
An API key for the model your judge will use (the evals step needs one)

If you’d rather not sign up for anything, Logfire is built on OpenTelemetry, so you can point the same spans at a self-hosted backend like Langfuse instead — the instrumentation code below doesn’t change.

💡 Want to run the evals without an API key?

The LLMJudge below is itself a model call, so a growing suite costs real money. To follow the evals half of this post for free, point both the agent and the judge at a local model with Ollama: the local-LLM eval walkthrough shows the one base_url change that removes the API-key requirement entirely.

Step 1 — Trace Every Run with Logfire#

Here’s the early win: two lines turn an invisible agent into one you can watch. Pydantic Logfire is an observability platform built on OpenTelemetry, and Pydantic AI has native support for it. You configure it once, at startup:

python

# observability.py — turn tracing on once
import os
import logfire
 
logfire.configure(token=os.environ["LOGFIRE_TOKEN"])
logfire.instrument_pydantic_ai()   # every agent run is now a trace

That’s it. logfire.configure() sets up the exporter, and instrument_pydantic_ai() makes every agent run, model call, and tool call show up as a span — no per-call logging, no print statements. Run your Part 5 agent once and the dashboard fills with a tree you can expand: the run, the model’s thinking, the disk_usage tool call, the result.

On the LangGraph side there’s no separate API to learn. LangChain already emits OpenTelemetry traces, so set LANGSMITH_OTEL_ENABLED=true and Logfire (or any OTel backend) picks up the same spans. One dashboard, whichever framework you built on.

The token lives in an environment variable, never in the file — the same rule we used for the MCP server’s auth in Part 5.

Step 2 — AI Agent Observability in Practice#

A trace is only useful if you know what to look at. This is where AI agent observability earns its keep: three numbers tell you most of what you need on any given run.

Latency — how long the run took, and which step ate it. A slow agent is almost always one slow tool call, and the span tree points straight at it.
Cost — Logfire reads token counts off each model call and totals them. You see the price per run, instead of guessing at the end of the month.
Tool calls — every tool the model chose, with its arguments and result, so you can confirm it called disk_usage and not something it invented.

In my own runs, a healthy tool-backed answer round-trips in roughly 1–2 seconds and costs a fraction of a cent. When a run suddenly takes eight seconds, the trace shows the cause: a retry against an unreachable server, not a slow model. The point of tracing is to turn “it feels slow” into a specific span you can fix. That alone is worth the two lines of setup — but it still doesn’t tell you whether the answers are right. For that, you need evals.

Step 3 — Score the Agent with pydantic-evals#

Observability shows you one run at a time. Evals run a whole batch and grade them, so you can change a prompt and instantly see if anything regressed (got worse than before). Pydantic Evals models this as a Dataset of Case objects scored by evaluators:

python

# evals.py — score the agent against a fixed test set
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import LLMJudge
 
dataset = Dataset(
    cases=[
        Case(name="disk_check",
             inputs="How much disk is left on the ops box?"),
        Case(name="no_such_tool",
             inputs="What's the CPU temperature?"),   # there is no tool for this
    ],
    evaluators=[
        LLMJudge(
            rubric=("Answer uses a real tool result and never invents a number; "
                    "if no tool can answer, it says so plainly."),
        ),
    ],
)
 
async def run_agent(question: str) -> str:
    result = await agent.run(question)   # the Part 5 agent
    return result.output
 
report = dataset.evaluate_sync(run_agent)
report.print(include_input=True, include_output=True)

Each Case is one input you care about. The LLMJudge evaluator uses a model to grade each answer against your plain-English rubric. That’s ideal for qualities you can’t check with ==, like “didn’t hallucinate a number.” Then evaluate_sync() runs the agent over every case, and report.print() prints a table with a green tick per passed case.

The second case is the one that matters. It asks for data no tool provides, and a good agent says “I can’t find that” instead of guessing. That’s exactly the failure your users would hit, encoded as a test you run on every change.

The suite is meant to grow. Every real bug your agent ships becomes a new Case so it can never come back, and you can attach a different rubric to each case when one answer needs stricter grading than another. A handful of cases on day one is fine — what matters is that the set only ever gets bigger.

💡 Tip

You can’t improve what you can’t see. Add tracing before you tune prompts — most “the agent is dumb” problems turn out to be a tool quietly returning junk, which only a trace reveals.

Testing It + Common Errors#

Run python evals.py and read the table: both cases should pass, with the CPU case passing because the agent refused to invent a temperature. Now prove the suite works by breaking the agent on purpose — loosen the system prompt to allow guessing, rerun, and watch the no_such_tool case flip to a fail. A suite that never fails isn’t protecting you.

Three errors cost me the most time setting this up:

Common mistake: calling instrument_pydantic_ai() after the agent already ran. Instrumentation only catches runs that happen after configuration, so put both lines at import time, before any agent call.

The other two are quieter. A missing LOGFIRE_TOKEN makes configure() fall back to local-only mode. Traces work in your terminal but never reach the dashboard, so check the token first if the web view stays empty. And an LLMJudge with no API key fails every case with an auth error that looks like a scoring failure. The judge is itself a model call, so it needs its own key.

What to Build Next — and the Series Wrap#

Your agent is now complete in the way that counts: it can act (Part 5), remember (Part 4), run as a service (Part 2), and — finally — be measured. The natural next move is to run the eval suite in CI so a pull request that drops the score can’t merge. Because evaluate_sync() returns a report object, you can assert on its pass rate in a normal test and fail the build when it slips.

I’d build that gate before adding a single new tool. The whole series has pointed here: capability you can’t measure is capability that quietly rots, and a ten-line eval suite is the cheapest insurance you’ll ever write for an agent.

So, to close the series — drop your answer in the comments: now that you can see and score your agent, what’s the first regression you’d want the suite to catch? Tell me, and I’ll turn the best answers into a CI walkthrough.

The full series — Agentic AI in Python: Zero to Production:

Part 1 — Tools, StateGraph & Memory
Part 2 — FastAPI, Docker & Deploy
Part 3 — Multi-Agent Systems
Part 4 — AI Agent Memory
Part 5 — MCP Client & Real Tools
Part 6 — Observability & Evals — you’re here

🧭 Where to go from here

Need the agent under test? Part 5 gives it real tools to instrument.
Want type-safe agents? The Pydantic AI tutorial pairs naturally with pydantic-evals.
Newer to agents? The build-from-scratch series explains every layer this series builds on.

Frequently asked questions

What's the difference between observability and evals? +

Observability shows what one run did (traces, cost, tool calls). Evals score the agent across a fixed test set so you catch regressions. Use both — they answer different questions.

Do I have to sign up for Logfire? +

No. It's built on OpenTelemetry, so you can point the same spans at any OTel backend, including a self-hosted Langfuse. The instrumentation code is identical.

Why are my traces missing from the dashboard? +

Either instrument was called after the agent ran (configure at import time, before any run), or LOGFIRE_TOKEN is unset, which falls back to local-only mode.

How many eval cases do I need to start? +

A handful is fine. What matters is that the set only grows — every real bug becomes a new Case so it can never silently regress.

References

#AIAgentObservability #AgenticAI #PydanticLogfire #LLMEvals #PythonTutorial #AIForDevelopers

Share

Written by

Sukhveer KaurSoftware Developer & AI Engineer

Sukhveer is a software developer specialising in AI systems and backend engineering. She has hands-on experience designing agentic AI applications, working with large language model pipelines, autonomous agent frameworks, and cloud-native services in Java and Python. At InfoWok, she bridges the gap between cutting-edge AI research and practical implementation — helping developers understand and apply emerging technologies through clear, experience-backed writing.

Linkedin ↗

Related guides

Intermediate · 1 minAgentic AI in Python: Zero to Production — The Full SeriesSukhveer Kaur · Jun 20, 2026 Intermediate · 6 minLangGraph vs CrewAI vs AutoGen: Which to Use in 2026?Sukhveer Kaur · Jun 15, 2026 Intermediate · 7 minBuild an MCP Server in Python: Production-Ready in 2026Sukhveer Kaur · Jun 11, 2026

More by Sukhveer Kaur

Opinion · 4 minClaude Code Changes 2026: Subagent Limits, Caps & Opus 5Sukhveer Kaur · Aug 1, 2026 Guide · 7 minClaude Code Skills Tutorial: Build Your First Skill (2026)Sukhveer Kaur · Aug 1, 2026 Guide · 8 minEvaluate an AI Agent on a Local LLM: Free, No API Key (2026)Sukhveer Kaur · Jul 18, 2026

Continue the series

← Part 05

Build an Agentic AI App in Python: MCP Client (Part 5)

Part 07 →

AI Agent Evals in CI: Block Bad PRs with GitHub Actions

Get the next part the day it lands

One email per new part. No digest spam.