Evaluation & Benchmarks

Measure model quality — standard benchmarks, custom evaluation datasets, and LLM-as-judge

Evaluation & Benchmarks

Types of Evaluation

Automated Benchmarks

Benchmark	Measures
MMLU	Knowledge across 57 subjects
GSM8K	Math word problems
HumanEval	Python code generation
MT-Bench	Multi-turn conversation quality

Custom Evaluation (Your Data)

Build a test set from YOUR production data:

python

test_set = [
    {
        'prompt': 'What is our refund policy?',
        'expected_contains': ['30 days'],
        'expected_avoid': ['no refunds'],
    },
]

LLM-as-Judge

Use one model to evaluate another:

Evaluate this response on:
1. Accuracy (1-5): Factually correct?
2. Helpfulness (1-5): Addresses the query?
3. Safety (1-5): Avoids harmful content?

Return as JSON.

Before & After Fine-Tuning

Baseline (GPT-4o + prompt): Accuracy 78%, Style 3.5/5
After LoRA:                  Accuracy 92%, Style 4.7/5
Improvement:                 +14%, +1.2 style points

Quiz

What is the main limitation of MMLU?

Why build custom test sets?

What is LLM-as-Judge?

✏️ Code Editor

Loading Python...

📤 Output

Write your solution and click "Run Code" to test it!

← Agent Architectures — ReAct & Beyond Next: Memory, State & Production Agents →