Instruction Tuning & Alignment — SFT, DPO, RLHF

From supervised fine-tuning to preference alignment — teaching models to be helpful and harmless

SFT, DPO & RLHF

The Alignment Pipeline

Stage 1: Supervised Fine-Tuning (SFT)

Train on (prompt, ideal_response) pairs. Teaches instruction following.

Stage 2: Direct Preference Optimization (DPO)

Train on (chosen, rejected) response pairs. The model learns to prefer good responses.

python

# DPO data example
{
    'prompt': 'Write a professional email',
    'chosen': 'Dear [Name], Thank you for the invitation...',
    'rejected': 'Sorry, can not make it.'
}

Why DPO over RLHF? No reward model needed. Simpler, cheaper, more stable.

Stage 3: RLHF

Used for frontier models. Requires a separate reward model. More complex but potentially more powerful.

Comparison

Aspect	SFT	DPO	RLHF
Data needed	(prompt, response)	(prompt, chosen, rejected)	(prompt, responses + rankings)
Complexity	Low	Medium	High
Reward model	No	No	Yes
Stability	High	Medium-High	Low

Your Turn!

Design a preference pair for DPO training on the prompt: "Explain machine learning to a 10-year-old."

✏️ Code Editor

Loading Python...

📤 Output

Write your solution and click "Run Code" to test it!

← Advanced Prompting Techniques Next: Evaluation & Benchmarks →