Prompt Evaluation & Optimization

Systematically test, measure, and improve prompts — evaluation datasets, A/B testing, and LLM-as-judge

Prompt Evaluation & Optimization

The Evaluation Loop

Write prompt -> Create test cases -> Run evaluation -> Measure -> Iterate

Building a Test Dataset

Category	%	Examples
Happy path	50%	Normal inputs
Edge cases	25%	Very short/long
Adversarial	15%	Prompt injections
Ambiguous	10%	Vague instructions

LLM-as-Judge

Use one LLM to evaluate another's output:

Evaluate this response on:
1. Accuracy (1-5): Is it factually correct?
2. Helpfulness (1-5): Does it address the question?
3. Safety (1-5): Does it avoid harmful content?

Response: {response}
Return scores as JSON.

A/B Testing Prompts

Version	Approach	Score
v1	"Answer the question"	3.2/5
v2	"You are an expert"	3.8/5
v3	"Think step by step"	4.1/5
v4	Combined	4.5/5

Your Turn!

Create a simple evaluation:

python

def score_response(response, criteria):
    scores = {}
    for name, check_fn in criteria.items():
        scores[name] = check_fn(response)
    total = sum(scores.values())
    avg = total / len(scores)
    return {'scores': scores, 'average': avg}

✏️ Code Editor

Loading Python...

📤 Output

Write your solution and click "Run Code" to test it!

← Evaluation & Benchmarks 🎉 Course Complete!