How do I run AI agent evals in CI?

Wrap your pydantic-evals suite in a test that asserts on the pass rate (report.averages().assertions), then run it with pytest from a GitHub Actions workflow triggered on pull_request. A failing assertion fails the build.

How do I make the eval check required before merging?

Add a branch protection rule on your default branch and mark the workflow's job as a required status check. GitHub then blocks the merge button until the eval job passes.

Why do my evals pass locally but fail in CI?

Almost always missing secrets. The CI runner has no API keys unless you add them under Settings → Secrets and pass them as env in the workflow. Note that secrets are not exposed to pull requests from forks.

Won't running evals on every PR get expensive?

It can, because each case is a real model call. Keep the case set small, set temperature to 0 for repeatability, and if needed run the gate only on PRs touching agent code or labelled for review.

AI Agent Evals in CI: Block Bad PRs with GitHub Actions

Series: Agentic AI in Python — Zero to Production
This is Part 7, the finale. The story so far:
Part 5: Real tools via an MCP client — the agent can act
Part 6: Observability & evals — you can see and score it
New here? You need the eval suite from Part 6 and a GitHub repo. This post turns that suite into a gate.

In Part 6 you built an eval suite that scores your agent against a fixed test set. Here’s the uncomfortable truth: a green eval run on your laptop protects no one. You have to remember to run it, and the one time you’re in a hurry is the one time a regression ships. The fix is to make the machine run it for you. By the end of this post, AI agent evals in CI will mean a pull request that drops your agent’s score literally cannot merge.

This is the last piece of the series, and it’s the cheapest insurance you’ll write. Ten lines of config turn “I hope someone ran the evals” into “the build is red, so it didn’t.”

🟡 Intermediate⏱️ 20 minStack: Python 3.11+, pytest, GitHub Actions

✅ Before you start

The eval suite from Part 6 — this post turns it into a CI gate
A GitHub repo for your agent, with permission to add Actions and branch rules
Basic familiarity with pytest and YAML workflows — new to either? Read the pytest and CI primer

🎯 Key takeaways

A test that asserts on the pass rate turns your eval suite into a pass/fail signal CI understands.
GitHub Actions on pull_request runs that test on every change, automatically.
A branch protection rule makes the check required — that’s what actually blocks the merge.

What We’re Building

We’re wiring your existing eval suite into a gate on your repository. The goal is simple: when a pull request lowers the agent’s score, the build fails and the merge button locks.

The CI eval gate: a pull request triggers a GitHub Actions runner, which runs the eval suite — the agent over your test cases plus an LLM judge — and if the pass rate meets the threshold the PR can merge, otherwise it is blocked

The diagram shows the whole loop: a PR triggers a runner, the runner runs your evals, and the resulting pass rate decides whether the merge is allowed. Nothing here is agent-specific magic — it’s the same pattern teams use for unit tests, pointed at your eval score instead.

Why CI and not a pre-commit hook? A hook runs on your machine, so it’s easy to skip with --no-verify and it never runs on a teammate’s change. CI runs on a server you don’t control, on every PR, for everyone. That’s the difference between a habit and a guarantee — and with agents, where a one-line prompt tweak can quietly tank quality, you want the guarantee.

Prerequisites: What You Need

You need three things before the gate works, and each maps to one step below.

The Part 6 eval suite — an evals.py with a Dataset of Cases and a run_agent function.
A GitHub repository for the agent, with the code on a branch.
Your model API keys (OPENAI_API_KEY, and LOGFIRE_TOKEN if you trace) — ready to paste as secrets, never committed.

Five steps to set up the CI eval gate: turn the eval suite into a test that asserts on the pass rate, add a GitHub Actions workflow on pull requests, store the API keys as repository secrets, make the check required with a branch protection rule, so any pull request that drops the score is blocked

If you don’t have the eval suite yet, build it first in Part 6 — this post assumes a dataset that already runs locally.

Step 1 — Turn Evals into a Test That Fails

CI doesn’t read pretty tables; it reads exit codes. So the first job is to express “the agent is good enough” as an assertion that throws when the score drops. The EvaluationReport from Part 6 makes this a two-liner: call averages() to get the aggregate, and read .assertions — the fraction of your assertion checks that passed, from 0 to 1.

python

# test_evals.py — fails the build when the agent regresses
from evals import dataset, run_agent      # reuse your Part 6 suite

PASS_THRESHOLD = 1.0                       # require every assertion to pass

def test_agent_quality():
    report = dataset.evaluate_sync(run_agent)
    rate = report.averages().assertions    # aggregate pass rate, 0.0–1.0
    assert rate >= PASS_THRESHOLD, f"Eval pass rate dropped to {rate:.0%}"

Run pytest test_evals.py and it passes when every case passes, and fails the moment one regresses — exactly the signal CI needs. Naming the file test_evals.py lets pytest discover it automatically, so you don’t need a custom runner.

The threshold is a policy choice, not a constant. Set it to 1.0 for a small, high-stakes suite where every case must hold, or to something like 0.9 once your set is large enough that one flaky LLM-judged case shouldn’t block the whole team. Start strict and loosen only when a real false failure forces you to.

Step 2 — The GitHub Actions Workflow

Now we make GitHub run that test on every pull request. Create one file, .github/workflows/evals.yml, and the runner does the rest.

yaml

name: agent-evals
on:
  pull_request:                 # run the gate on every PR

jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install -r requirements.txt
      - run: pytest test_evals.py -q
        env:                    # the runner has no keys unless you pass them
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

The on: pull_request trigger is the whole point — the suite runs before the merge, not after. Each step is ordinary CI: check out the code, install Python 3.12, install dependencies, then run the test. The env block hands your secret to the process; without it, the model call fails in CI even though it works on your machine.

⚠️ Common mistakeHardcoding the API key, or forgetting the `env:` block entirely. The first leaks a credential into git history forever; the second makes every CI run fail with an auth error that looks like a code bug. Always pass keys through `secrets`.

Step 3 — Add Secrets and Make the Check Required

Two small settings turn a workflow that runs into a gate that blocks. First, give the runner its keys. In your repo, go to Settings → Secrets and variables → Actions → New repository secret, and add OPENAI_API_KEY (and LOGFIRE_TOKEN if you trace). These are encrypted and exposed only to your workflows.

Second — and this is the step people skip — make the check required. A green or red check is just advisory until you enforce it. Add a branch protection rule on your default branch:

Open Settings → Branches → Add branch ruleset (or a classic branch protection rule).
Target your default branch (usually main).
Enable Require status checks to pass before merging, then select the evals job.

🔑 Key pointWithout branch protection, the eval check is a suggestion. With it, GitHub disables the merge button until the eval job is green. That rule — not the YAML — is what actually stops a regression.

One real-world caveat: GitHub does not expose secrets to pull requests opened from forks. If outside contributors send PRs, their eval job can’t reach your keys, so plan to run the gate on a trusted branch or via a maintainer-approved workflow for fork contributions.

Testing It + Common Errors

Don’t trust a gate you haven’t seen fail. The best test is to break the agent on purpose: open a PR that loosens the system prompt to allow guessing, and watch the no_such_tool-style case flip to a fail. The evals check should go red and the merge button should lock. Revert, and it goes green. A gate that never fails isn’t protecting anything.

Three errors cost me the most time the first time I set this up:

Passes locally, fails in CI. Almost always a missing secret. Confirm the exact name matches the env key, character for character.
Flaky failures. An LLM judge isn’t fully deterministic. Set temperature=0 in your eval runs and, for larger suites, drop the threshold slightly so one borderline case doesn’t block everyone.
Surprise cost. Every run is real model calls. Keep the case set small and focused; if it grows, scope the trigger to pull_request paths that touch agent code so docs-only PRs skip the spend.

What to Build Next

You now have the safety net the whole series was building toward. The natural next layer is guardrails — runtime checks that validate an agent’s output during a request, not just in CI, so a bad answer is caught before it reaches a user. I’d add input validation and an output schema check first, because they catch the failures evals can’t predict.

After that, two cheap wins: post the eval pass rate as a PR comment so reviewers see the score inline, and add a small cost budget assertion to the same test so a change that doubles token usage also fails the build. Both reuse the report object you already have.

Conclusion

That’s the series. Your agent can act, remember, run as a service, be measured — and now it’s guarded, because the one check that matters runs without you remembering to run it. The gate is ten lines of config, but it changes the default: regressions are caught before review, not after release. Capability you can’t protect is capability that quietly rots, and now yours can’t.

I’ll leave you with the question that decides how strict your gate should be: what’s the one regression you’d never want to ship — a hallucinated number, a wrong tool call, a refusal that should’ve been an answer? Tell me in the comments, and encode it as your first required case.

The full series — Agentic AI in Python: Zero to Production:

🧭 Where to go from here

Need the eval suite first? Part 6 builds the tests this gate runs.
Start the series: Part 1 builds the agent everything else extends.
Want type-safe outputs? Pair this with the Pydantic AI tutorial.

AI Agent Evals in CI: Block Bad PRs with GitHub Actions

What We’re Building

Prerequisites: What You Need

Step 1 — Turn Evals into a Test That Fails

Step 2 — The GitHub Actions Workflow

Step 3 — Add Secrets and Make the Check Required

Testing It + Common Errors

What to Build Next

Conclusion

Frequently asked questions

References

Tags

Share

Get the next part when it lands