Loading...
Loading...
Measure model quality — standard benchmarks, custom evaluation datasets, and LLM-as-judge
| Benchmark | Measures |
|---|---|
| MMLU | Knowledge across 57 subjects |
| GSM8K | Math word problems |
| HumanEval | Python code generation |
| MT-Bench | Multi-turn conversation quality |
Build a test set from YOUR production data:
test_set = [
{
'prompt': 'What is our refund policy?',
'expected_contains': ['30 days'],
'expected_avoid': ['no refunds'],
},
]Use one model to evaluate another:
Evaluate this response on:
1. Accuracy (1-5): Factually correct?
2. Helpfulness (1-5): Addresses the query?
3. Safety (1-5): Avoids harmful content?
Return as JSON.Baseline (GPT-4o + prompt): Accuracy 78%, Style 3.5/5
After LoRA: Accuracy 92%, Style 4.7/5
Improvement: +14%, +1.2 style points