Your agent demos beautifully — three test questions, clean output, you ship it. Two weeks later it’s quietly calling the wrong API, passing a malformed customer ID, and burning forty steps on tasks that should take four, and no number ever caught it. That’s the case for AI agent evaluation: you can’t ship what you can’t measure, and a good-looking final answer is not a measurement.
This guide is how to evaluate an AI agent for real in 2026: the metrics that actually matter, the three ways to score them, a tiny eval harness you can run today, and the framework landscape so you know which tool to reach for. The throughline is one shift the whole field made this year — you grade the path, not just the answer.
- You’ve built or run at least one tool-using agent. New to that? Start with What are AI agents?
- Comfortable reading Python (functions, dicts, lambdas)
- Helpful but optional: you’ve seen an agent loop or misbehave — e.g. why agents keep looping
- Grade the trajectory, not just the output. An agent can return a perfect answer while calling the wrong tool or wasting ten steps — only path-level metrics catch that.
- Six metrics cover most needs: task success, tool-call accuracy, trajectory quality, faithfulness, latency, and cost — plus policy adherence for enterprise work.
- Score with the cheapest reliable method: deterministic checks first, LLM-as-a-judge for open-ended quality (with a rubric), human review to calibrate the judge.
Why Evaluating an Agent Isn’t Like Testing a Model
A single LLM call is easy to grade: one input, one output, compare it to what you wanted. An agent breaks that model because it makes a sequence of decisions — plan, call a tool, read the result, decide again — and any step in that chain can be wrong while the final answer still looks right. That’s the trap. The output passes a human glance, so you never notice the agent picked refund_full instead of refund_partial, or retried a failing call eight times before stumbling into success.
So the big move in 2026 is from single-turn scoring to trajectory-level evaluation: grading the whole path the agent took — which tools it called, whether it recovered from a failed call, and how many wasted steps it racked up. Benchmarks pushed this further. Sierra’s τ-bench (and its successor τ²-bench) fails an agent that books the right flight but violates the stated change-fee policy. That’s a far higher bar than “did it finish,” and it’s the bar real deployments are held to.
Bottom line: if your eval only looks at the last message, it’s blind to most of the ways an agent actually fails.
The Metrics That Actually Matter
You don’t need thirty metrics. You need a handful that map to the ways agents break. Here’s the working set, grouped into three tiers.
- Task success (the outcome). Did the agent reach the correct end state? This is the headline number, and it’s verified on the result, not the wording. An agent can call every tool correctly and still fail the task — so you always need an end-to-end check.
- Tool-call accuracy. Did it choose the right tool and pass the right arguments? Plain LLM metrics tell you nothing about whether it hit the correct endpoint or passed the correct order ID. Split it into tool selection and tool argument correctness.
- Trajectory quality. The path itself: step count, whether it recovered from a failed call, and wasted or looping steps. A run that succeeds in 14 steps when 4 would do is a quiet cost and reliability problem — the same failure behind agents that loop.
- Faithfulness (groundedness). Is the answer actually supported by the tool outputs it saw, or did the model fill gaps with invention? This is the agent cousin of RAG faithfulness.
- Latency and cost. Wall-clock time, token spend, and dollars per task. These are first-class quality metrics for anything user-facing or run at scale.
- Policy adherence. Did it follow the rules — refund limits, allowed actions, tone — even when breaking them would “complete” the task? This is the enterprise bar τ-bench made famous.
Bottom line: pick one metric from each tier — outcome, trajectory, operational — before you write a line of eval code.
How to Score Them: Deterministic, Judge, Human
Knowing what to measure is half of it; the other half is how to produce a score. There are three methods, and the skill is using the cheapest reliable one for each metric.
- Deterministic checks — use these first. End-state assertions, exact tool-argument matches, step counts. When the answer is checkable in code, an LLM judge only adds cost and noise. Most of task success, tool-call accuracy, and trajectory quality can be deterministic.
- LLM-as-a-judge — for open-ended quality. Faithfulness, reasoning, helpfulness — things with no single right string. It’s fast and cheap, and it’s the default for subjective metrics, but it’s unreliable if you wing it.
- Human review — the calibration anchor. Label a sample by hand and use it to check your judge agrees. This is also your ground truth for the nuanced cases neither code nor a model gets right.
Bottom line: deterministic where you can, judge where you must, human to keep the judge honest.
A Minimal Eval Harness You Can Run
Frameworks do this at scale, but the core is a scoring function over a recorded run — and writing it once makes every framework make sense. This harness checks the end state and the tool calls, including their arguments, with no API key and no dependencies. To show why that matters it scores two runs: a clean one, and a buggy one that refunds the wrong amount but still “finishes.”
def sig(call):return (call["name"], tuple(sorted(call["args"].items())))# A correct run must reach this end state AND make exactly these calls.spec = {"success": lambda s: s.get("refunded") and s.get("emailed"),"expected_calls": [{"name": "lookup_order", "args": {"id": "1234"}},{"name": "issue_refund", "args": {"id": "1234", "amount": 49.0}},{"name": "send_email", "args": {"to": "buyer@example.com"}},],"max_steps": 4,}def evaluate(run, spec):made = [sig(c) for c in run["tool_calls"]]want = {sig(c) for c in spec["expected_calls"]}correct = want & set(made) # right tool AND right argswrong = [c for c in run["tool_calls"] if sig(c) not in want]return {"task_success": bool(spec["success"](run["final_state"])),"tool_accuracy": round(len(correct) / len(want), 2),"wrong_calls": len(wrong),"steps": len(made),"wasted_steps": max(0, len(made) - spec["max_steps"]),}good = {"tool_calls": [{"name": "lookup_order", "args": {"id": "1234"}},{"name": "issue_refund", "args": {"id": "1234", "amount": 49.0}},{"name": "send_email", "args": {"to": "buyer@example.com"}}],"final_state": {"refunded": True, "emailed": True}}buggy = {"tool_calls": [ # refunds $12 instead of $49, but still "finishes"{"name": "lookup_order", "args": {"id": "1234"}},{"name": "issue_refund", "args": {"id": "1234", "amount": 12.0}},{"name": "send_email", "args": {"to": "buyer@example.com"}}],"final_state": {"refunded": True, "emailed": True}}print(evaluate(good, spec))print(evaluate(buggy, spec))# good -> {'task_success': True, 'tool_accuracy': 1.0, 'wrong_calls': 0, 'steps': 3, 'wasted_steps': 0}# buggy -> {'task_success': True, 'tool_accuracy': 0.67, 'wrong_calls': 1, 'steps': 3, 'wasted_steps': 0}
Both runs report task_success: True — the outcome check is fooled. Only the argument-level tool_accuracy (0.67) and wrong_calls (1) notice the buggy run refunded the wrong amount. That’s the entire case for trajectory metrics, in two lines of output. Run it over a dataset instead of two runs, average the scores, and you have an offline eval.
Deterministic checks can’t grade everything, though. Faithfulness — is the answer actually supported by the tool output? — has no single right string, so you hand it to a judge behind a strict rubric (illustrative; judge_model is any chat model, ideally stronger than the agent’s):
import jsonJUDGE_RUBRIC = """Grade whether the ANSWER is fully supported by the TOOL OUTPUT.Reply as JSON: {"faithful": true|false, "reason": "<one sentence>"}.Mark false if the answer asserts anything the tool output does not contain."""def judge_faithfulness(answer, tool_output, judge_model):reply = judge_model.invoke([{"role": "system", "content": JUDGE_RUBRIC},{"role": "user", "content": f"TOOL OUTPUT:\n{tool_output}\n\nANSWER:\n{answer}"},])return json.loads(reply.content) # -> {"faithful": False, "reason": "..."}
A real framework wraps both halves — the deterministic scorer and the judge — with tracing, a results UI, and CI hooks, but the scoring loop is exactly this.
Bottom line: an eval is a scoring function over a golden dataset — the framework is convenience on top of those two things.
The AI Agent Evaluation Framework Landscape (2026)
When the hand-rolled harness stops scaling, here’s where to go. The tools split into code-first libraries, tracing-plus-eval platforms, and benchmarks.
- DeepEval — pytest-style, open-source (Apache 2.0), with ready-made agent metrics (task completion, tool correctness). The fastest on-ramp if you think in unit tests.
- Ragas — reference-free metrics, strongest on faithfulness and context; born for RAG, extended to agents.
- LangSmith — LangChain-native trajectory evals, datasets, and tracing in one place; the default if you already build on LangChain.
- Arize Phoenix and Langfuse — open-source, OpenTelemetry-native tracing with evaluation layered on; great for seeing the trajectory you’re grading.
- Braintrust, Maxim, Galileo — hosted, eval-first platforms that bundle experimentation, production monitoring, and runtime guardrails.
For a reality check beyond your own dataset, lean on public benchmarks: τ²-bench (tool-agent-user with policy adherence), GAIA (general assistant tasks), and SWE-bench (real software issues). They won’t match your domain, but they’re how you sanity-check a model choice before you commit.
Bottom line: start with one open-source library plus a tracing layer; add a platform when you need hosted monitoring and guardrails.
Putting It Together: a Workflow
The pieces assemble into a loop you run forever, not a one-time gate.
- Define success per task as a checkable end state — before anything else.
- Build a golden dataset of 10–30 representative tasks; grow it from real failures.
- Score with deterministic checks plus a calibrated judge for the fuzzy metrics.
- Gate offline in CI on every change, and monitor online on production traces.
- Feed new failures back into the dataset so the eval gets stronger over time.
Bottom line: evaluation is a flywheel — every production failure becomes a permanent test.
Quick Recap
| Question | Metric tier | How to score it |
|---|---|---|
| Did it reach the right end state? | Outcome — task success | Deterministic assertion |
| Right tool, right arguments? | Trajectory — tool-call accuracy | Deterministic match |
| How many steps / did it recover? | Trajectory — quality | Step count, recovery check |
| Is the answer grounded? | Outcome — faithfulness | LLM-as-a-judge (rubric) |
| Fast and affordable enough? | Operational — latency, cost | Measured from traces |
| Did it follow the rules? | Outcome — policy adherence | Deterministic + judge |
Conclusion
Evaluating an agent isn’t about one magic score — it’s about measuring the path as carefully as the answer, with the cheapest reliable method for each metric. Pick one outcome metric, one trajectory metric, and one operational metric; build a small golden dataset; score it deterministically where you can and with a calibrated judge where you can’t. That gives you a number you can trust before you ship and a monitor that watches what your dataset missed.
Do the unglamorous part first: write down what “success” means for ten real tasks. That single artifact turns “it seems to work” into a measurement — and a measurement is the only thing you can actually improve.
What’s the first metric you’d add to your agent’s eval — task success, tool accuracy, or cost? Tell me in the comments, especially if a trajectory metric once caught a bug an output check missed.
Read next: Why Your LangGraph Agent Keeps Looping — wasted-step and loop metrics in action, with the fixes behind them.
- New to agents? Ground the basics with What are AI agents? then build your first one.
- Picking a stack? Compare options in Best AI Agent Frameworks (2026).
- Going to production? Add a human-in-the-loop checkpoint for the actions your eval flags as risky.
