How the hell do you test LLMs?

A guide to deterministically testing non-deterministic systems

Mar 17, 2026

When you decide to ship a product with agents you immediately gain a load of problems that standard engineering practices don’t have a straight forward answer to. Your logic is great, the code compiles, all the unit tests pass, aaand then in production the whole thing falls over because the model decided that today it will hallucinate a new field name.

Welcome to non-deterministic systems.

This is an ongoing field of experimentation and improvements. I’m sure in 2 years time new testing frameworks and fantastically smart engineers will have revolutionised this space. In the meantime, here’s what we did at Grafos to get our agents working well consistently.

Traditional tests

Software tests generally fall into four categories.

Unit tests: Quick and cheap, tests out specific functions behaviours
Integration tests: Expanded scope, how these modules and functions interact with each other
E2E tests: How the entire system works
QA/User acceptance testing: Once shipped, manually review in staging

As you move down the list, these tests get more expensive, slower to run, and harder to maintain. Generally, you want to structure your code so it’s easier to test in the earlier stages. The later stages exist to catch any bugs that the earlier stages can’t, not to replace them.

We can apply this model of thinking to how we test our agents and LLMs.

0th law of agents: shrink the non-deterministic behaviour

Before writing anything, reduce what the LLM can output.

When productionising our agents, we swapped from outputting raw JSON which we had to parse to using a structured output with Pydantic. This forces the agent to respond with the fields we expect, and a whole host of our tests can be about field validation rather than parsing prose. This is a godsend to force your agent to cut out the preamble in intermediate steps.

We also ground the context. For our policy pipeline, we explicitly inject valid resource types from the users infrastructure before generating any Cypher. We can fully test this injection, and then test that the agent is only returning these valid resources. This cuts the hallucinations from ‘any possible resource type’ to only what is actually relevant. This doesn’t fit semantic errors (more on that later) but does reduce the possible errors.

Protip: Set the temperature to 0. This doesn’t eliminate all variance, but it sure does reduce it.

1st law of agents: trust no output

You’re going to need validation steps to make sure that the live agent’s output is doing what you expect it to. If you don’t have these guardrails - add them. These are fully testable, and testing the guardrails themselves doesn’t depend on the agents output.

In our policy workflow, when we generate a Cypher query we check it with two steps.

Regex pattern matching: This is a library of known bad patterns. These are fed into the agent in the generation stage as well as part of our RAG, but it’s worth checking again before we let any bad code escape
Live execution: We run the generated query against a minimal in memory database seeded with the correct field names

This catches real syntax errors, type mismatches, and poisoning behaviour before they get to an actual db. We can also check terraform structural validation, have retries, and timeouts. These are all normal python functions and can be tested as such.

If your guardrails work, even if your agent misbehaves sometimes, the user will not experience it and you’ll have no disruptions.

2nd law of agents: the data must flow

Most of your agent’s logic is actually deterministic. Routing logic, state transitions, and error propagation are all core flow that you can test by mocking the LLM’s response. This is not testing the LLM, but how you handle the response.

In LangGraph you can do node level unit tests to make sure the logic is correct on a variety of responses, test interrupt behaviour and run also multi-turn simulations with pre-built messages and responses. These are chat-level integration tests.

3rd law of agents: judge, jury, and executioner

Sometimes the code is syntactically valid and passes all your guardrails, but just still does the wrong thing.

In this case, you can use a higher quality model in your testing suite to check the output. This is known as LLM-as-judge which sounds absolutely mental, but it does work. Both Microsoft and IBM have created production tooling around this concept.

Here we use a second, structured call to a different model to evaluate the quality of the output.

class PolicyJudgement(BaseModel):
    logic_is_correct: bool       # Does it find violating resources, not compliant ones?
    uses_correct_resource_type: bool
    uses_correct_attribute: bool
    reasoning: str

async def judge_policy_translation(
    policy_intent: str,
    generated_cypher: str,
    resource_context: str,
) -> PolicyJudgement:
    judge = ChatGoogleGenerativeAI(model="gemini-2.5-pro", temperature=0)
    return await judge.with_structured_output(PolicyJudgement).ainvoke([
        SystemMessage(POLICY_JUDGE_PROMPT),
        HumanMessage(f"""
Policy intent: {policy_intent}
Resource types available: {resource_context}
Generated Cypher: {generated_cypher}

Evaluate whether this Cypher correctly identifies VIOLATING resources 
(not compliant ones) for the stated policy.
        """)
    ])

More pro tips:

Always get a reason from the judge so you can debug it
True/False scoring will give you better results, LLMs aren’t great for fine grained numbers
LLMs normally prefer longer answers so put checks in place for verbosity

4th law of agents: quest for the golden dataset

Everytime your agent performs a tricky action or a complex workflow, you should log it and use it to build up a golden dataset. This is a collection of annotated input/output test cases that we can put straight into our testing suite to make sure our behaviour doesn’t degrade with future changes or additions.

This is a labelled end to end test which hits every part of your stack, and the real LLM behaviour. By far the best data comes from real production traffic - build user feedback directly into your agents to be able to track good and bad responses to add to your test cases. Overtime, your dataset will become invaluable to stop model regressions reaching production.

Testing builds trust

Agents are famously flaky. We can reduce their exposure to the rest of system, and then layer these non deterministic pathways over one another to reduce the resulting noise. This comprehensive testing increases the trust in the system, but it begs the question - is it really deterministic?

Grafos.ai's Substack

Discussion about this post

Ready for more?