Basics of AI Evals: Why They're a Must Now

JAVA AI

API

Basics of AI Evals: Why They're a Must Now

Have you ever asked an AI chatbot, "What's the best way to learn Python?" and gotten a different answer each time?

Posted by

Adharsh Dhandapani

November 5, 2025

What are AI Evals? Why do we need them?

‍

Have you ever asked an AI chatbot, "What's the best way to learn Python?" and gotten a different answer each time?

One response suggests starting with Django tutorials, another recommends data science notebooks, and a third emphasises fundamentals through Project Euler. All three answers are helpful, well-structured, and... completely different.

This ‘Non-deterministic’ nature of AI is the crucial difference when we are looking to unit test the same AI.

‍

Deterministic vs. Non-Deterministic: Why Unit Tests Fall Short

‍

Traditional unit tests are built for deterministic systems, i.e. traditional software, where a single input always results in the same expected output as shown below:

‍

def test_add():

assert add(2, 3) == 5 # Always true or false

‍

Whereas LLMs are fundamentally non-deterministic.

The same prompt can generate multiple valid responses, each with different wording, structure, and emphasis. You can't write assert response == "expected output" when there are dozens of correct answers. You can't check for an exact JSON structure when the model might format things differently each time. The probabilistic, creative nature of LLMs breaks the core assumption unit tests rely on: the ability to assert against a repeated, expected output given the same input.

That's where AI evals come in.

‍

What are AI Evals?

‍

AI evals (short for evaluations) are systematic methods for measuring how well an LLM performs on specific tasks. Think of them as the testing infrastructure for AI systems—but instead of asserting that add(2, 3) == 5, you're evaluating whether a model can accurately summarise a legal document, generate helpful code suggestions, or follow safety guidelines.

At their core, evals consist of three components:

1. Test cases: A dataset of inputs (prompts) paired with expected outputs or success criteria. For example, a customer support eval might include real customer questions with gold-standard responses.

2. Evaluation logic: Code that determines whether the model's output is "correct." This can range from exact string matching (simple but brittle) to using another LLM as a judge (flexible but introduces another layer of uncertainty).

3. Metrics: Quantitative measures that aggregate performance across many test cases. Common metrics include accuracy, precision/recall, or task-specific scores like BLEU for translation or Rouge for summarisation.

‍

Types of Evals

‍

There are fundamentally two approaches to evaluating LLM outputs:

‍

1. Traditional Automated Methods

‍

When your task has objective, measurable criteria, you can use deterministic checks:

‍

Classification tasks: If your LLM outputs discrete categories (sentiment analysis, content moderation, intent classification), you can measure accuracy directly.

‍

{ 
  "prompt": "This product exceeded my expectations!", 
  "expected": "positive", 
  "model_output": "positive", 
  "correct": true 
}

‍

‍Structural validation: Check if the output matches expected formats (valid JSON, required fields present, length constraints, specific keywords or patterns).

‍

Factual checks: When you have a ground truth database, you can verify factual claims programmatically.

‍

These methods are fast, cheap, and deterministic—but they only work for constrained tasks where "correctness" is objective.

‍

2. LLM-as-Judge

‍

For open-ended generation tasks—writing explanations, crafting emails, answering nuanced questions—automated checks fall short. The solution? Use another LLM to evaluate the output.

judge_prompt = f""" Rate this customer support response on a scale of 1-5: Customer question: {question} Agent response: {response} Criteria: - Does it accurately address the question? - Is the tone professional and helpful? - Is it concise without being terse? Provide a score and a brief justification. """

LLM judges can assess subjective qualities like helpfulness, tone, clarity, and completeness. They can even check for subtle issues like hallucinations by comparing the response against source documents.

The catch: LLM judges have biases. They favour longer responses, prefer certain writing styles, and can be inconsistent. Always validate your judge against human ratings on a subset of your eval set, and be explicit about your evaluation criteria in the judge prompt.

‍

Building an Eval Pipeline

‍

Here's a pragmatic approach to getting started:

‍

Start with a golden dataset

‍

Collect 50-100 representative examples from your actual use case. Include edge cases, common scenarios, and known failure modes. Quality trumps quantity—a carefully curated small set beats a large noisy dataset.

‍

Define clear success criteria

‍

Be specific about what "good" means. "The response should be helpful" is too vague. Better: "The response should correctly answer the question, cite relevant sources if applicable, and be under 200 words."

‍

Automate what you can

‍

Build automated checks for objective criteria:

Response length constraints
Required keywords or patterns
Structural requirements (e.g., must be valid JSON)
Basic factuality checks against known data

‍

Use a hybrid evaluation

‍

Combine automated metrics with human review. Run automated evals on every change, but periodically have humans rate a sample of outputs. This catches issues that automated metrics miss while keeping the process scalable.

‍

Version everything

‍

Track:

Model versions (e.g., gpt-4-turbo vs claude-3-opus)
Prompt templates
System prompts and few-shot examples
Eval datasets (they should evolve too)
Evaluation criteria

‍

Practical Tools

‍

The ecosystem is rapidly maturing. Popular frameworks include:

OpenAI Evals: Open-source framework with a registry of standard evals
Weights & Biases: Experiment tracking with LLM-specific features
LangSmith: End-to-end LLM app development with built-in evaluation
Anthropic's Claude eval library: Specialised tools for evaluating Claude models
Braintrust: Production-ready eval and observability platform

Most teams also build custom tooling tailored to their specific use case.

What else is different?

‍

We've covered the obvious challenge: LLMs don't produce deterministic outputs like traditional functions do. But that's just the beginning.

What other fundamental differences do you see between unit tests and AI evals? How does testing for "helpfulness" compare to testing for "correctness"? When your test suite uses an LLM to judge another LLM's output, what new problems does that create?

These aren't just theoretical questions—the answers shape how we build, ship, and maintain AI systems in production.

‍

What's your experience with evaluating LLM outputs? Share your thoughts in the comments.

‍

Basics of AI Evals: Why They're a Must Now

Adharsh Dhandapani

What are AI Evals? Why do we need them?

Deterministic vs. Non-Deterministic: Why Unit Tests Fall Short

What are AI Evals?

Types of Evals

1. Traditional Automated Methods

2. LLM-as-Judge

Building an Eval Pipeline

Start with a golden dataset

Define clear success criteria

Automate what you can

Use a hybrid evaluation

Version everything

Practical Tools

What else is different?

Ready to transform your business?

Let's build the future together.