Loading...
Loading...
From supervised fine-tuning to preference alignment — teaching models to be helpful and harmless
Train on (prompt, ideal_response) pairs. Teaches instruction following.
Train on (chosen, rejected) response pairs. The model learns to prefer good responses.
# DPO data example
{
'prompt': 'Write a professional email',
'chosen': 'Dear [Name], Thank you for the invitation...',
'rejected': 'Sorry, can not make it.'
}Why DPO over RLHF? No reward model needed. Simpler, cheaper, more stable.
Used for frontier models. Requires a separate reward model. More complex but potentially more powerful.
| Aspect | SFT | DPO | RLHF |
|---|---|---|---|
| Data needed | (prompt, response) | (prompt, chosen, rejected) | (prompt, responses + rankings) |
| Complexity | Low | Medium | High |
| Reward model | No | No | Yes |
| Stability | High | Medium-High | Low |
Design a preference pair for DPO training on the prompt: "Explain machine learning to a 10-year-old."