Evaluating LLM Outputs

Overview

Evaluating LLMs is fundamentally harder than evaluating traditional ML models. A spam classifier has one correct answer per input. An LLM asked to "write a professional email declining a meeting" has thousands of valid answers. There is no single ground truth to compare against.

This makes LLM evaluation a combination of art and engineering. You need multiple evaluation approaches, each capturing different aspects of quality. Human evaluation is the gold standard but too expensive for every change. Automated metrics are cheap but miss nuance. LLM-as-judge sits in between, offering scalable evaluation with known biases. The practical answer is to use all three, with the right approach for the right situation.

Human Evaluation

The Gold Standard

Human evaluation is the most trustworthy way to assess LLM outputs, but it is expensive and slow:

def create_human_evaluation_task(outputs, criteria):
    """Create a structured evaluation task for human reviewers.
    
    Key: give humans specific criteria, not vague instructions.
    'Rate this response 1-5' produces useless data.
    'Is the response factually correct? Yes/No/Partially' 
    produces useful data.
    """
    evaluation_tasks = []
    
    for output in outputs:
        task = {
            "prompt": output["prompt"],
            "response": output["response"],
            "ratings": {},
        }
        
        for criterion in criteria:
            task["ratings"][criterion["name"]] = {
                "description": criterion["description"],
                "scale": criterion["scale"],
                "value": None,  # Human fills this in
            }
        
        evaluation_tasks.append(task)
    
    return evaluation_tasks

# Example criteria
criteria = [
    {
        "name": "factual_accuracy",
        "description": "Are all claims in the response verifiably true?",
        "scale": ["all_correct", "mostly_correct", "some_errors", "major_errors"],
    },
    {
        "name": "completeness",
        "description": "Does the response address all parts of the prompt?",
        "scale": ["complete", "mostly_complete", "partial", "incomplete"],
    },
    {
        "name": "helpfulness",
        "description": "Would this response help the user accomplish their goal?",
        "scale": ["very_helpful", "somewhat_helpful", "not_helpful", "harmful"],
    },
    {
        "name": "tone",
        "description": "Is the tone appropriate for the context?",
        "scale": ["appropriate", "slightly_off", "inappropriate"],
    },
]

Pairwise Comparison

Easier for humans than absolute scoring. Show two outputs side by side:

def pairwise_evaluation(prompt, response_a, response_b):
    """Ask humans to compare two responses directly.
    
    Pairwise comparison is more reliable than absolute scoring
    because humans are better at relative judgments than
    calibrating a 1-5 scale consistently.
    """
    task = {
        "prompt": prompt,
        "response_a": response_a,  # Randomize order to avoid bias
        "response_b": response_b,
        "question": "Which response is better overall?",
        "options": ["A is better", "B is better", "About the same"],
        "follow_up": "Briefly explain why.",
    }
    return task

Cost and Scale

Human evaluation costs (approximate):

Absolute rating (1-5 scale):
  Per example:    $0.50-2.00
  100 examples:   $50-200
  Turnaround:     1-3 days
  Best for:       Final quality checks, benchmark creation

Pairwise comparison:
  Per pair:       $0.30-1.00
  100 pairs:      $30-100
  Turnaround:     1-2 days
  Best for:       Comparing model versions, A/B decisions

Expert evaluation (domain-specific):
  Per example:    $5-50
  100 examples:   $500-5,000
  Turnaround:     1-2 weeks
  Best for:       Medical, legal, technical accuracy

LLM-as-Judge

Using a Strong Model to Evaluate a Weaker One

Fast and cheap, but introduces its own biases:

def llm_as_judge(prompt, response, criteria):
    """Use GPT-4o to evaluate another model's response.
    
    Biases to be aware of:
    - Prefers longer responses (verbosity bias)
    - Prefers its own style (self-preference bias)
    - Prefers responses listed first (position bias)
    - Struggles with factual verification
    """
    evaluation_prompt = f"""Evaluate the following response to a user prompt.

User prompt: {prompt}

Response to evaluate: {response}

Evaluate on these criteria:
{json.dumps(criteria, indent=2)}

For each criterion, provide:
1. A rating (from the provided scale)
2. A brief justification (1-2 sentences)

Return your evaluation as JSON."""

    result = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "You are an expert evaluator. Be critical and specific. "
                           "Do not default to positive ratings.",
            },
            {"role": "user", "content": evaluation_prompt},
        ],
        temperature=0,
    )
    
    return json.loads(result.choices[0].message.content)

Mitigating Judge Biases

def pairwise_llm_judge_with_debiasing(prompt, response_a, response_b):
    """Compare two responses with position bias mitigation.
    
    Run the comparison twice with the order swapped.
    If the judge picks the same response both times,
    the preference is likely real. If it flips, 
    the preference is likely position bias.
    """
    judge_prompt_template = """Compare these two responses to the user prompt.

User prompt: {prompt}

Response A: {first}

Response B: {second}

Which response is better? Consider accuracy, helpfulness, 
and clarity. Output JSON: {{"winner": "A" or "B" or "tie", 
"reason": "brief explanation"}}"""

    # Run 1: A first, B second
    result_1 = call_judge(judge_prompt_template.format(
        prompt=prompt, first=response_a, second=response_b
    ))
    
    # Run 2: B first, A second (swap positions)
    result_2 = call_judge(judge_prompt_template.format(
        prompt=prompt, first=response_b, second=response_a
    ))
    
    # Reconcile results
    winner_1 = result_1["winner"]
    winner_2 = "B" if result_2["winner"] == "A" else (
        "A" if result_2["winner"] == "B" else "tie"
    )
    
    if winner_1 == winner_2:
        return {"winner": winner_1, "confidence": "high"}
    else:
        return {"winner": "tie", "confidence": "low (position bias detected)"}

When LLM-as-Judge Works vs When It Fails

Works well:
  Style and tone evaluation       (is this professional?)
  Format compliance               (did it follow the template?)
  Instruction following           (did it answer the question?)
  Relative comparison             (which of these two is better?)
  Coherence and fluency           (does this read well?)

Works poorly:
  Factual accuracy                (the judge may not know the facts)
  Domain expertise                (medical, legal, financial claims)
  Numerical reasoning             (math errors pass undetected)
  Subtle hallucinations           (plausible but false statements)
  Cultural nuance                 (judge has its own cultural bias)

Automated Metrics

Traditional NLP Metrics

These metrics compare generated text to reference text. They are fast and deterministic but capture only surface-level similarity:

from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu

def compute_reference_metrics(generated, reference):
    """Traditional metrics that compare to a reference answer.
    
    Useful for summarization and translation.
    Mostly useless for open-ended generation.
    """
    # BLEU: measures n-gram overlap
    # Good for translation, bad for open-ended tasks
    bleu = sentence_bleu(
        [reference.split()], 
        generated.split(),
    )
    
    # ROUGE: measures recall of reference n-grams
    # Good for summarization
    scorer = rouge_scorer.RougeScorer(
        ["rouge1", "rouge2", "rougeL"], use_stemmer=True
    )
    rouge = scorer.score(reference, generated)
    
    return {
        "bleu": bleu,
        "rouge1_f": rouge["rouge1"].fmeasure,
        "rouge2_f": rouge["rouge2"].fmeasure,
        "rougeL_f": rouge["rougeL"].fmeasure,
    }

When to Use Automated Metrics

Metric              Best for                    Limitations
BLEU                Translation                 Penalizes valid paraphrases
ROUGE               Summarization               Ignores factual correctness
Semantic similarity Open-ended comparison       Misses factual errors
Exact match         Structured output (JSON)    Too strict for free text
Pass rate           Code generation             Binary, no partial credit

Task-Specific Evaluation

Beyond general quality, evaluate task-specific dimensions:

Factuality: Break the response into claims, check each against source documents. Essential for RAG systems.
Format compliance: Validate JSON schemas, check required fields. Binary pass/fail.
Safety: Check for PII leakage, harmful content, and whether the model refused when it should have.

Real-World Example: Evaluating a Customer Support Bot

A team builds an LLM-powered customer support bot. They need to evaluate quality before launch.

Automated metrics (run on every PR): Format compliance (does the response include a greeting and sign-off?), length check (50-500 words), safety check (no PII leakage, no harmful content). These catch obvious failures in seconds.

LLM-as-judge (run weekly): Rate 500 responses on helpfulness, accuracy, and tone. Compare the bot's performance across different ticket categories. Identify categories where the bot struggles. Cost: approximately 25 dollars per run.

Human evaluation (run before major releases): Domain experts review 100 responses from the bot, focusing on factual accuracy and policy compliance. Pairwise comparison against the previous version. Cost: approximately 2,000 dollars per run.

Result: Automated metrics catch 80% of regressions immediately. LLM-as-judge catches subtle quality issues within a week. Human evaluation validates that the bot meets company standards before launch.

Common Pitfalls

Relying on BLEU or ROUGE for open-ended generation: These metrics measure word overlap, not quality. A response can have zero BLEU score and still be excellent.
Using the same model as both generator and judge: If GPT-4o generates the response and GPT-4o judges it, the evaluation is biased toward that model's style and preferences.
Not randomizing order in pairwise comparisons: LLM judges strongly prefer the first response shown. Always randomize and run both orderings.
Evaluating on too few examples: Fifty examples is not enough for reliable evaluation. Aim for 200+ for automated metrics and 100+ for human evaluation.
Ignoring edge cases in evaluation: Your evaluation set should include adversarial inputs, ambiguous queries, and out-of-scope requests, not just happy-path examples.
Treating evaluation as a one-time activity: LLM quality can change with model updates, prompt changes, or data shifts. Evaluate continuously.

Key Takeaways

LLM evaluation requires multiple approaches: automated metrics for speed, LLM-as-judge for scale, human evaluation for ground truth.
Traditional NLP metrics (BLEU, ROUGE) are mostly useless for open-ended generation. Use them only for translation and summarization.
LLM-as-judge is powerful but biased. Mitigate position bias by swapping order. Never use the same model as both generator and judge.
Task-specific evaluation (factuality, format compliance, safety) is more valuable than general quality scores.
Build a layered evaluation pipeline: cheap automated checks on every change, LLM-based evaluation on major changes, human evaluation before releases.
Human evaluation remains the gold standard. Budget for it and use it to calibrate your automated approaches.