Series: Agentic AI in Python — Zero to Production
This is Part 7, the finale. The story so far:
- Part 5: Real tools via an MCP client — the agent can act
- Part 6: Observability & evals — you can see and score it
New here? You need the eval suite from Part 6 and a GitHub repo. This post turns that suite into a gate.
In Part 6 you built an eval suite that scores your agent against a fixed test set. Here’s the uncomfortable truth: a green eval run on your laptop protects no one. You have to remember to run it, and the one time you’re in a hurry is the one time a regression ships. The fix is to make the machine run it for you. By the end of this post, AI agent evals in CI will mean a pull request that drops your agent’s score literally cannot merge.
This is the last piece of the series, and it’s the cheapest insurance you’ll write. Ten lines of config turn “I hope someone ran the evals” into “the build is red, so it didn’t.”
- The eval suite from Part 6 — this post turns it into a CI gate
- A GitHub repo for your agent, with permission to add Actions and branch rules
- Basic familiarity with
pytestand YAML workflows — new to either? Read the pytest and CI primer
- A test that asserts on the pass rate turns your eval suite into a pass/fail signal CI understands.
- GitHub Actions on
pull_requestruns that test on every change, automatically. - A branch protection rule makes the check required — that’s what actually blocks the merge.
What We’re Building
We’re wiring your existing eval suite into a gate on your repository. The goal is simple: when a pull request lowers the agent’s score, the build fails and the merge button locks.
The diagram shows the whole loop: a PR triggers a runner, the runner runs your evals, and the resulting pass rate decides whether the merge is allowed. Nothing here is agent-specific magic — it’s the same pattern teams use for unit tests, pointed at your eval score instead.
Why CI and not a pre-commit hook? A hook runs on your machine, so it’s easy to skip with --no-verify and it never runs on a teammate’s change. CI runs on a server you don’t control, on every PR, for everyone. That’s the difference between a habit and a guarantee — and with agents, where a one-line prompt tweak can quietly tank quality, you want the guarantee.
Prerequisites: What You Need
You need three things before the gate works, and each maps to one step below.
- The Part 6 eval suite — an
evals.pywith aDatasetofCases and arun_agentfunction. - A GitHub repository for the agent, with the code on a branch.
- Your model API keys (
OPENAI_API_KEY, andLOGFIRE_TOKENif you trace) — ready to paste as secrets, never committed.
If you don’t have the eval suite yet, build it first in Part 6 — this post assumes a dataset that already runs locally.
Step 1 — Turn Evals into a Test That Fails
CI doesn’t read pretty tables; it reads exit codes. So the first job is to express “the agent is good enough” as an assertion that throws when the score drops. The EvaluationReport from Part 6 makes this a two-liner: call averages() to get the aggregate, and read .assertions — the fraction of your assertion checks that passed, from 0 to 1.
# test_evals.py — fails the build when the agent regressesfrom evals import dataset, run_agent # reuse your Part 6 suitePASS_THRESHOLD = 1.0 # require every assertion to passdef test_agent_quality():report = dataset.evaluate_sync(run_agent)rate = report.averages().assertions # aggregate pass rate, 0.0–1.0assert rate >= PASS_THRESHOLD, f"Eval pass rate dropped to {rate:.0%}"
Run pytest test_evals.py and it passes when every case passes, and fails the moment one regresses — exactly the signal CI needs. Naming the file test_evals.py lets pytest discover it automatically, so you don’t need a custom runner.
The threshold is a policy choice, not a constant. Set it to 1.0 for a small, high-stakes suite where every case must hold, or to something like 0.9 once your set is large enough that one flaky LLM-judged case shouldn’t block the whole team. Start strict and loosen only when a real false failure forces you to.
Step 2 — The GitHub Actions Workflow
Now we make GitHub run that test on every pull request. Create one file, .github/workflows/evals.yml, and the runner does the rest.
name: agent-evalson:pull_request: # run the gate on every PRjobs:evals:runs-on: ubuntu-lateststeps:- uses: actions/checkout@v4- uses: actions/setup-python@v5with:python-version: "3.12"- run: pip install -r requirements.txt- run: pytest test_evals.py -qenv: # the runner has no keys unless you pass themOPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
The on: pull_request trigger is the whole point — the suite runs before the merge, not after. Each step is ordinary CI: check out the code, install Python 3.12, install dependencies, then run the test. The env block hands your secret to the process; without it, the model call fails in CI even though it works on your machine.
Step 3 — Add Secrets and Make the Check Required
Two small settings turn a workflow that runs into a gate that blocks. First, give the runner its keys. In your repo, go to Settings → Secrets and variables → Actions → New repository secret, and add OPENAI_API_KEY (and LOGFIRE_TOKEN if you trace). These are encrypted and exposed only to your workflows.
Second — and this is the step people skip — make the check required. A green or red check is just advisory until you enforce it. Add a branch protection rule on your default branch:
- Open Settings → Branches → Add branch ruleset (or a classic branch protection rule).
- Target your default branch (usually
main). - Enable Require status checks to pass before merging, then select the
evalsjob.
One real-world caveat: GitHub does not expose secrets to pull requests opened from forks. If outside contributors send PRs, their eval job can’t reach your keys, so plan to run the gate on a trusted branch or via a maintainer-approved workflow for fork contributions.
Testing It + Common Errors
Don’t trust a gate you haven’t seen fail. The best test is to break the agent on purpose: open a PR that loosens the system prompt to allow guessing, and watch the no_such_tool-style case flip to a fail. The evals check should go red and the merge button should lock. Revert, and it goes green. A gate that never fails isn’t protecting anything.
Three errors cost me the most time the first time I set this up:
- Passes locally, fails in CI. Almost always a missing secret. Confirm the exact name matches the
envkey, character for character. - Flaky failures. An LLM judge isn’t fully deterministic. Set
temperature=0in your eval runs and, for larger suites, drop the threshold slightly so one borderline case doesn’t block everyone. - Surprise cost. Every run is real model calls. Keep the case set small and focused; if it grows, scope the trigger to
pull_requestpaths that touch agent code so docs-only PRs skip the spend.
What to Build Next
You now have the safety net the whole series was building toward. The natural next layer is guardrails — runtime checks that validate an agent’s output during a request, not just in CI, so a bad answer is caught before it reaches a user. I’d add input validation and an output schema check first, because they catch the failures evals can’t predict.
After that, two cheap wins: post the eval pass rate as a PR comment so reviewers see the score inline, and add a small cost budget assertion to the same test so a change that doubles token usage also fails the build. Both reuse the report object you already have.
Conclusion
That’s the series. Your agent can act, remember, run as a service, be measured — and now it’s guarded, because the one check that matters runs without you remembering to run it. The gate is ten lines of config, but it changes the default: regressions are caught before review, not after release. Capability you can’t protect is capability that quietly rots, and now yours can’t.
I’ll leave you with the question that decides how strict your gate should be: what’s the one regression you’d never want to ship — a hallucinated number, a wrong tool call, a refusal that should’ve been an answer? Tell me in the comments, and encode it as your first required case.
The full series — Agentic AI in Python: Zero to Production:
- Part 1 — Tools, StateGraph & Memory
- Part 2 — FastAPI, Docker & Deploy
- Part 3 — Multi-Agent Systems
- Part 4 — AI Agent Memory
- Part 5 — MCP Client & Real Tools
- Part 6 — Observability & Evals
- Part 7 — Evals in CI — you’re here
Related: AI Agent Observability & Evals (Part 6) and Pydantic AI — Type-Safe AI Agents
- Need the eval suite first? Part 6 builds the tests this gate runs.
- Start the series: Part 1 builds the agent everything else extends.
- Want type-safe outputs? Pair this with the Pydantic AI tutorial.
