Basics of AI Evals: Why They're a Must Now
Have you ever asked an AI chatbot, "What's the best way to learn Python?" and gotten a different answer each time?

What are AI Evals? Why do we need them?
Have you ever asked an AI chatbot, "What's the best way to learn Python?" and gotten a different answer each time?
One response suggests starting with Django tutorials, another recommends data science notebooks, and a third emphasises fundamentals through Project Euler. All three answers are helpful, well-structured, and... completely different.
This ‘Non-deterministic’ nature of AI is the crucial difference when we are looking to unit test the same AI.
Deterministic vs. Non-Deterministic: Why Unit Tests Fall Short
Traditional unit tests are built for deterministic systems, i.e. traditional software, where a single input always results in the same expected output as shown below:
def test_add():
assert add(2, 3) == 5 # Always true or false
Whereas LLMs are fundamentally non-deterministic.
The same prompt can generate multiple valid responses, each with different wording, structure, and emphasis. You can't write assert response == "expected output" when there are dozens of correct answers. You can't check for an exact JSON structure when the model might format things differently each time. The probabilistic, creative nature of LLMs breaks the core assumption unit tests rely on: the ability to assert against a repeated, expected output given the same input.
That's where AI evals come in.
What are AI Evals?
AI evals (short for evaluations) are systematic methods for measuring how well an LLM performs on specific tasks. Think of them as the testing infrastructure for AI systems—but instead of asserting that add(2, 3) == 5, you're evaluating whether a model can accurately summarise a legal document, generate helpful code suggestions, or follow safety guidelines.
At their core, evals consist of three components:
1. Test cases: A dataset of inputs (prompts) paired with expected outputs or success criteria. For example, a customer support eval might include real customer questions with gold-standard responses.
2. Evaluation logic: Code that determines whether the model's output is "correct." This can range from exact string matching (simple but brittle) to using another LLM as a judge (flexible but introduces another layer of uncertainty).
3. Metrics: Quantitative measures that aggregate performance across many test cases. Common metrics include accuracy, precision/recall, or task-specific scores like BLEU for translation or Rouge for summarisation.
Types of Evals
There are fundamentally two approaches to evaluating LLM outputs:
1. Traditional Automated Methods
When your task has objective, measurable criteria, you can use deterministic checks:
Classification tasks: If your LLM outputs discrete categories (sentiment analysis, content moderation, intent classification), you can measure accuracy directly.
{
"prompt": "This product exceeded my expectations!",
"expected": "positive",
"model_output": "positive",
"correct": true
}
Structural validation: Check if the output matches expected formats (valid JSON, required fields present, length constraints, specific keywords or patterns).
Factual checks: When you have a ground truth database, you can verify factual claims programmatically.
These methods are fast, cheap, and deterministic—but they only work for constrained tasks where "correctness" is objective.
2. LLM-as-Judge
For open-ended generation tasks—writing explanations, crafting emails, answering nuanced questions—automated checks fall short. The solution? Use another LLM to evaluate the output.
judge_prompt = f""" Rate this customer support response on a scale of 1-5: Customer question: {question} Agent response: {response} Criteria: - Does it accurately address the question? - Is the tone professional and helpful? - Is it concise without being terse? Provide a score and a brief justification. """
LLM judges can assess subjective qualities like helpfulness, tone, clarity, and completeness. They can even check for subtle issues like hallucinations by comparing the response against source documents.
The catch: LLM judges have biases. They favour longer responses, prefer certain writing styles, and can be inconsistent. Always validate your judge against human ratings on a subset of your eval set, and be explicit about your evaluation criteria in the judge prompt.
Building an Eval Pipeline
Here's a pragmatic approach to getting started:
Start with a golden dataset
Collect 50-100 representative examples from your actual use case. Include edge cases, common scenarios, and known failure modes. Quality trumps quantity—a carefully curated small set beats a large noisy dataset.
Define clear success criteria
Be specific about what "good" means. "The response should be helpful" is too vague. Better: "The response should correctly answer the question, cite relevant sources if applicable, and be under 200 words."
Automate what you can
Build automated checks for objective criteria:
- Response length constraints
- Required keywords or patterns
- Structural requirements (e.g., must be valid JSON)
- Basic factuality checks against known data
Use a hybrid evaluation
Combine automated metrics with human review. Run automated evals on every change, but periodically have humans rate a sample of outputs. This catches issues that automated metrics miss while keeping the process scalable.
Version everything
Track:
- Model versions (e.g., gpt-4-turbo vs claude-3-opus)
- Prompt templates
- System prompts and few-shot examples
- Eval datasets (they should evolve too)
- Evaluation criteria
Practical Tools
The ecosystem is rapidly maturing. Popular frameworks include:
- OpenAI Evals: Open-source framework with a registry of standard evals
- Weights & Biases: Experiment tracking with LLM-specific features
- LangSmith: End-to-end LLM app development with built-in evaluation
- Anthropic's Claude eval library: Specialised tools for evaluating Claude models
- Braintrust: Production-ready eval and observability platform
Most teams also build custom tooling tailored to their specific use case.
What else is different?
We've covered the obvious challenge: LLMs don't produce deterministic outputs like traditional functions do. But that's just the beginning.
What other fundamental differences do you see between unit tests and AI evals? How does testing for "helpfulness" compare to testing for "correctness"? When your test suite uses an LLM to judge another LLM's output, what new problems does that create?
These aren't just theoretical questions—the answers shape how we build, ship, and maintain AI systems in production.
What's your experience with evaluating LLM outputs? Share your thoughts in the comments.


