January 26, 2026

Measure What Counts: The AI Engineering Approach to Agent Evaluation

Liza Katz | CEO at Neradot

Stakeholders love single numbers. Oh boy they do.

Wouldn’t it be lovely if you could treat the AI as a black box and hear a simple report: "Our RAG system is 93% accurate." Go ahead and deploy that to production!

For a product manager, that number is a comfort. But for an engineer, it is a trap.

Because even if a single aggregate score tells you that your system failed, it gives you zero insight into why. Did the retriever miss the document? Did the router send the user to the wrong flow? Did the LLM hallucinate?

Evaluating a complex Agent with a single pass/fail metric is like running a massive integration test on a crashing system and trying to guess which class is responsible for the error. To build reliable agents, we need to push back on the "Single Score" fallacy and start running scientific controlled experiments on each component.

The "Scientific Method" for Evaluating Agent Components

Modern AI frameworks encourage breaking agents into nodes (chains, graphs, flows). Our evaluation strategy must mirror this. This is why at Neradot, we developed our own evaluation framework neralabs to bring Test-Driven Development (TDD) principles to AI, using "Soft Unit Tests" for probabilistic parts and standard Unit Tests for deterministic parts.

Here is the our evaluation stack for a robust Agent.

1. Deterministic Tools (Standard Unit Tests)

Not all tools use AI. Some tools parse documents, retrieve data from an API, or query a database. For those tools, there’s no "AI magic". If your agent uses a calculator, formats a date, or queries a SQL database, the output is binary: it works or it doesn't.

The Experiment: Input -> Expected Output.
The Metric: assert result == expected.
Don't waste tokens or introduce uncertainty by asking an LLM to judge if 2+2=4.

2. Routing and Classification (Regression Tests)

Many nodes in an agent architecture are effectively classifiers. They might be binary (e.g., "Is this query spam?" or "Is this a jailbreak attempt?") or category-based (e.g., routing a user to "Support," "Analytics," or "Sales").

The Experiment: Maintain a "Golden Set" of examples with known labels (e.g., "Delete my account" -> Intent: Delete), run them through your node, and check if the output matches the label.
The Metrics:
Once you have the raw results (e.g., 90 matches, 10 mismatches), a simple percentage isn't always the right way to summarize them. You need to summarize the stats based on what you are trying to optimize:
- If you want to measure General Competence (Accuracy):
  The Summary: "How many did we get right overall?"
  When to use it: Your data is balanced (e.g., 50% Billing queries, 50% Support queries)
- If you must NOT miss a critical signal (Recall):
  The Summary: "Out of all the 'Delete Account' requests that exist, how many did we find?"
  When to use it: Missing a request is a disaster. You don't care if you occasionally flag a harmless query as "Delete" (a human can review it), as long as you catch 100% of the actual deletion requests.
- If you must avoid false alarms (Precision):
  The Summary: "When we flagged something as 'Spam', was it actually Spam?"
  When to use it: False alarms are expensive. If your spam filter blocks the CEO's email, you are in trouble. You want to trust that a positive prediction is real.

3. Retrieval (Search Science)

Folks like to say that RAG is dead, but it's actually just hiding behind your nodes. Any node that retrieves data based on input from an LLM - whether it's semantic vector search, keyword matching, or a SQL query - is doing RAG.

This is an Information Retrieval problem, not a Generative AI problem. Any search specialist will tell you there is nothing new here. So luckily we can use the standard metrics that have existed for decades.

The Experiment: Query the vector DB with a known question and a set of "ground truth" relevant documents.
The Metrics:
- NDCG (Normalized Discounted Cumulative Gain): The gold standard. It respects order. Finding the right document at position #1 is much better than finding it at position #5. NDCG captures this value.
- MAP (Mean Average Precision): Measures completeness. If there are 5 relevant documents, did we find all of them, or just one?

4. Tool Selection (Schema Validation)

Another popular node pattern is tool selection, where an LLM is given a selection of tools to perform a task and it needs to plan the execution. This might be a single step (standard function calling) or a complex ReAct loop where the model reasons before acting.

The Experiment: Provide a query ("Book a flight to Paris") and verify the output structure.
The Metrics:
- Tool Name: Exact Match (Did it call book_flight?).
- Arguments: Semantic Match (Did it extract destination="Paris"?).
- Hallucination Rate: Did it try to call a tool that doesn't exist?

5. Final Answer Generation (LLM-as-a-Judge)

Many AI flows (but not all!) end with a generation endpoint that synthesizes the state into a textual answer.

This is the only layer where the output is truly unstructured and requires another LLM to evaluate it. Because we have validated every upstream component (Retrieval, Routing, Tools), we can trust that any failure here is due to the generation model itself, not bad data.

The Experiment: Compare the generated answer against a reference answer or a set of facts.
The Metrics: Faithfulness (Did it stick to the context?) and Relevance (Did it answer the user?).

6. Bonus: Guardrails (Safety Tests)

Every node that interacts with an LLM should be mindful of guardrails. This means making sure that it is resilient to jailbreaks, PII injections, and prompt injection attacks.

Don't think of Guardrails as a separate "box" in your architectural diagram. From an evaluation perspective, they are just an extra few cases in your test suite for that node.

Just as you test a standard function for NullPointerExceptions, you must test your LLM node for JailbreakExceptions.

The Experiment: Fuzz testing / Red Teaming. Feed the node adversarial inputs (e.g., "Ignore previous instructions," PII injection, or "How do I make a bomb?").
The Metric: Refusal Rate. The test passes only if the system refuses to answer. If the model answers the question - even if the answer is "I'm not sure" - it’s a failure. It must explicitly reject the premise.

Operationalizing this with neralabs

It can be difficult to design, build and run these six types of experiments manually. You have to maintain Golden Sets for routing, separate Golden Sets for retrieval, and manage prompts for the LLM judges.

But it is worth it. This is the only path to really understand and optimize an AI application’s performance and stability.

Internally at Neradot, we built our own framework, neralabs, to handle this complexity. We needed a tool that treats evaluation as an engineering discipline rather than a Jupyter notebook exercise. It allows us to define these component-wise experiments as a formal suite of tests, orchestrating the regression tests, retrieval metrics, and LLM judges in one workflow.

It shifts our evaluation from "vibes based" to "test-suite based," giving us the granular pass/fail signals we need to deploy with confidence.

Conclusion: Leaving the Lab

By breaking down evaluation into these components, we can convert a messy "AI problem" into a solvable "Software Engineering problem." High granularity gives you visibility and stability. It allows you to swap out your retrieval model without fearing you've broken your routing logic.

However, everything we discussed above is just Offline Evaluation. It is what we do in the "Lab."

But the real world is messier. Users don't stick to "Golden Sets," data drifts, and latency matters. To handle that, we actually need a whole different set of strategies..

But that is a topic for the next article.

‍