Loading...
Loading...
Systematically test, measure, and improve prompts — evaluation datasets, A/B testing, and LLM-as-judge
Write prompt -> Create test cases -> Run evaluation -> Measure -> Iterate| Category | % | Examples |
|---|---|---|
| Happy path | 50% | Normal inputs |
| Edge cases | 25% | Very short/long |
| Adversarial | 15% | Prompt injections |
| Ambiguous | 10% | Vague instructions |
Use one LLM to evaluate another's output:
Evaluate this response on:
1. Accuracy (1-5): Is it factually correct?
2. Helpfulness (1-5): Does it address the question?
3. Safety (1-5): Does it avoid harmful content?
Response: {response}
Return scores as JSON.| Version | Approach | Score |
|---|---|---|
| v1 | "Answer the question" | 3.2/5 |
| v2 | "You are an expert" | 3.8/5 |
| v3 | "Think step by step" | 4.1/5 |
| v4 | Combined | 4.5/5 |
Create a simple evaluation:
def score_response(response, criteria):
scores = {}
for name, check_fn in criteria.items():
scores[name] = check_fn(response)
total = sum(scores.values())
avg = total / len(scores)
return {'scores': scores, 'average': avg}