Sign up for our newsletter today and never miss a Neradot update
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Stakeholders love single numbers. Oh boy they do.
Wouldn’t it be lovely if you could treat the AI as a black box and hear a simple report: "Our RAG system is 93% accurate." Go ahead and deploy that to production!
For a product manager, that number is a comfort. But for an engineer, it is a trap.
Because even if a single aggregate score tells you that your system failed, it gives you zero insight into why. Did the retriever miss the document? Did the router send the user to the wrong flow? Did the LLM hallucinate?
Evaluating a complex Agent with a single pass/fail metric is like running a massive integration test on a crashing system and trying to guess which class is responsible for the error. To build reliable agents, we need to push back on the "Single Score" fallacy and start running scientific controlled experiments on each component.
Modern AI frameworks encourage breaking agents into nodes (chains, graphs, flows). Our evaluation strategy must mirror this. This is why at Neradot, we developed our own evaluation framework neralabs to bring Test-Driven Development (TDD) principles to AI, using "Soft Unit Tests" for probabilistic parts and standard Unit Tests for deterministic parts.
Here is the our evaluation stack for a robust Agent.
Not all tools use AI. Some tools parse documents, retrieve data from an API, or query a database. For those tools, there’s no "AI magic". If your agent uses a calculator, formats a date, or queries a SQL database, the output is binary: it works or it doesn't.
Many nodes in an agent architecture are effectively classifiers. They might be binary (e.g., "Is this query spam?" or "Is this a jailbreak attempt?") or category-based (e.g., routing a user to "Support," "Analytics," or "Sales").
Folks like to say that RAG is dead, but it's actually just hiding behind your nodes. Any node that retrieves data based on input from an LLM - whether it's semantic vector search, keyword matching, or a SQL query - is doing RAG.
This is an Information Retrieval problem, not a Generative AI problem. Any search specialist will tell you there is nothing new here. So luckily we can use the standard metrics that have existed for decades.
Another popular node pattern is tool selection, where an LLM is given a selection of tools to perform a task and it needs to plan the execution. This might be a single step (standard function calling) or a complex ReAct loop where the model reasons before acting.
Many AI flows (but not all!) end with a generation endpoint that synthesizes the state into a textual answer.
This is the only layer where the output is truly unstructured and requires another LLM to evaluate it. Because we have validated every upstream component (Retrieval, Routing, Tools), we can trust that any failure here is due to the generation model itself, not bad data.
Every node that interacts with an LLM should be mindful of guardrails. This means making sure that it is resilient to jailbreaks, PII injections, and prompt injection attacks.
Don't think of Guardrails as a separate "box" in your architectural diagram. From an evaluation perspective, they are just an extra few cases in your test suite for that node.
Just as you test a standard function for NullPointerExceptions, you must test your LLM node for JailbreakExceptions.
It can be difficult to design, build and run these six types of experiments manually. You have to maintain Golden Sets for routing, separate Golden Sets for retrieval, and manage prompts for the LLM judges.
But it is worth it. This is the only path to really understand and optimize an AI application’s performance and stability.
Internally at Neradot, we built our own framework, neralabs, to handle this complexity. We needed a tool that treats evaluation as an engineering discipline rather than a Jupyter notebook exercise. It allows us to define these component-wise experiments as a formal suite of tests, orchestrating the regression tests, retrieval metrics, and LLM judges in one workflow.
It shifts our evaluation from "vibes based" to "test-suite based," giving us the granular pass/fail signals we need to deploy with confidence.
By breaking down evaluation into these components, we can convert a messy "AI problem" into a solvable "Software Engineering problem." High granularity gives you visibility and stability. It allows you to swap out your retrieval model without fearing you've broken your routing logic.
However, everything we discussed above is just Offline Evaluation. It is what we do in the "Lab."
But the real world is messier. Users don't stick to "Golden Sets," data drifts, and latency matters. To handle that, we actually need a whole different set of strategies..
But that is a topic for the next article.