skip to content
Profile photo Data Herding

[AI-Evals] Evaluating LLM applications

/ 5 min read

Table of Contents

If you’ve built an AI application, you’ve probably experienced that moment of uncertainty: “Is this actually working?” So, you test a few prompts, things look good, but how do you know if its actually ready for production?

Based on insights from Hamel Husain and Shreya Shankar’s popular AI Evals course and my own experience building evaluations for the AI apps I built at my day job, I wrote up this guide that will show you how to build evaluation systems for your AI application

Evals Are A Necessity

While building an AI application, I’ve spent at least 40-50% of my time on manual error analysis and evaluation. Manually looking through your AI output traces are not an afterthought. That might sound excessive, but its important to know if the latest prompt change broke something, if your model is hallucinating on edge cases, or if users are quietly abandoning your product due to unreliable outputs.

The course identifies three critical “gulfs” in AI development:

  • Gulf of Comprehension: Understanding your data (go through the inputs and those output traces!)
  • Gulf of Specification: Configure the prompt carefully with the exact output you desire. LLMs aren’t humans and have varying performance across inputs
  • Gulf of Generalization: Make sure the LLM does well beyond the data it was trained on by identifying the right technqiue - context engineering, RAG, breaking down instructions into multiple steps, to make sure the LLM generalizes well enough.

Start with Error Analysis, Not Metrics

When I first built AI apps, my mistake was jumping right into the metrics. While metrics like relevancy scores, hallucination scores were useful, it took me a while to identify the root cause. I found that manually reviewing real user traces helped a lot more to uncover upstream issues and patterns. Block 30 minutes on your calendar, pull 10 real user interactions, and write down what went right and what went wrong.

This manual review accomplishes two things:

  1. It grounds you in actual user problems rather than imagined ones
  2. It helps identify failure modes that generic metrics would miss

Building Your Eval Pipeline Without Making It a Bottleneck

The key is to integrate evals into your development workflow, not treat them as a separate phase. Here’s a practical approach:

1. Define Pass/Fail Criteria That Matter

Generic metrics like BERTScore, ROUGE, and cosine similarity are not useful for evaluating LLM outputs in most AI applications. Instead, design binary pass/fail evals using LLM-as-judge or code-based assertions. Let’s take an example of building an AI powered real estate CRM assistant. Here’s what might be useful to measure

  • ✅ Do measure: “Does the system suggest only available showings?” (code assertion)
  • ✅ Do measure: “Does the response avoid confusing client personas?” (LLM-as-judge)
  • ❌ Don’t measure: Generic text similarity scores

2. Use LLM-as-Judge Thoughtfully

LLM-as-Judge approaches can effectively evaluate outputs when properly validated against human judgments. The trick is alignment - make sure you go through bunch of traces manually and label it. This might seem cumbersome, but it will only make you LLM judge a better evaluator.

3. Scale Your Testing with Synthetic Data

When you don’t have enough real examples, synthetic data fills the gaps. Synthetic data scales fast (you can easily generate thousands of test cases), fills gaps by adding missing scenarios and edge cases, and allows controlled testing to see how AI handles specific challenges.

The process typically involves two stages: context generation (selecting relevant chunks from your knowledge base) and input generation (creating questions/queries from those contexts). This reverses standard retrieval - instead of finding contexts from inputs, you create inputs from predefined contexts.

Key Metrics That Actually Matter

While every application needs custom metrics, here are the essential categories to consider:

For RAG Systems

RAG metrics measure either the retriever or generator in isolation. Retriever metrics include contextual recall, precision, and relevancy for evaluating things like top-K values and embedding models. Generator metrics include faithfulness and answer relevancy for evaluating the LLM and prompt template.

Core Universal Metrics

  • Hallucination: Determines whether an LLM output contains fake or made-up information
  • Answer Relevancy: Measures how well the response addresses the input in an informative and concise manner
  • Task Completion: Whether the system achieves its intended goal

The ROUGE Score Reality Check

Research shows ROUGE exhibits alarmingly low precision for identifying actual factual errors. These overlap based metrics systematically overestimate hallucination detection performance in QA, leading to illusory progress. Traditional NLP metrics weren’t necessarily designed for generative AI. It might provide useful insights to begin with but that’s about it.

Categorizing Failure Modes: Manual and Automated Approaches

Understanding why your system fails is as important as knowing that it fails. Start manually:

  1. Manual Categorization: Review failing cases and group them into patterns (e.g., “fails on multi-step reasoning,” “misunderstands temporal queries”)
  2. Automated Categorization: Once you identify patterns, use LLMs to categorize new failures automatically

Making Evals Part of Your Culture

Well crafted eval prompts effectively become living product requirements documents that continuously test your AI in real time. Iterate on the application as you get feedback signals from the evaluation pipeline. I’ve found this to be a far better choice than building an entire application with all the embedding AI features only to course correct or change requirements after evaluations or user testing. Start small, iterate quickly.

The Path Forward

Building reliable AI applications isn’t about finding the perfect model or crafting the ultimate prompt—it’s about creating a systematic approach to understanding and improving your system’s behavior. Evals are the language of trust. They are the only way to systematically prove that your AI is improving.

Start small:

  1. Manually review 10-20 real interactions today
  2. Identify your top 3 failure modes
  3. Write one simple pass/fail eval for each
  4. Generate 100 synthetic test cases
  5. Run these evals before every deployment

Remember: Nothing beats truly understanding your data! As a data scientist, digging into the data, looking for patterns is in my very DNA and training so it comes naturally to me. However, if you are not from a data background, spending time looking through the data might seem futile but would save you loads of time in the long run.

Ready to dive deeper? Check out Hamel and Shreya’s AI Evals course for hands-on training in evaluation-driven development.