On Preference Optimization and DPO

Introduction

Training with preference data has allowed large language models (LLMs) to be optimized for specific qualities such as trust, safety, and harmfulness. Preference optimization is the process of using this data to enhance LLMs. This method is particularly useful for tuning the model to emphasize certain features or for training scenarios where relative feedback is more practical, especially in situations where there is no single correct answer.

For instance, in the task of summarization, it is straightforward to determine which summary is better from a given set. Similarly, in content generation, users often need to produce content that matches their specific style, wording, phrasing, and tone. This type of personalization can be achieved through preference alignment. In recommendation systems, LLMs can generate suggestions based on users’ past preferences. By collecting preference data, these models can be trained to cater to individual tastes. Furthermore, LLMs can be fine-tuned for specific tasks to prioritize aspects such as sentiment, accuracy, or hallucinations, enhancing user trust and safety.

However, preference optimization is less useful in scenarios where there is a definitive correct answer, such as mathematical operations like adding two numbers. Though it can be used to reinforce the correct responses from a set of possible answers.

Methods like Supervised Fine-Tuning (SFT) or Parameter Efficient Fine-Tuning (PEFT) are employed to train LLMs to predict the next word in a sequence. Integrating preference alignment with techniques like Reinforcement Learning from Human Feedback (RLHF) or Direct Policy Optimization (DPO) helps incorporate delayed rewards into the model. This alignment enhances the stability of the model and reduces noise in probability distributions. When multiple responses are sampled from LLMs and feedback is provided based on preference, the models learn to decrease the likelihood of undesirable words and increase the likelihood of preferable ones.1

DPO (Direct Preference Optimization)

DPO is a popular method to perform preference optimization on LLMs. Let’s assume for a prompt $x$, you generate an response $w$ and $l$ where $w$ is better than $l$. DPO optimizes policy scores over reference score using ranking loss. The gradient from the loss below is flown back to policy model.

Ranking Loss
Ranking Lo…
win score relative to reference
win score r…
loss score relative to reference
loss score…
Text is not SVG – cannot display

  • How does it solves for delayed reward? The loss improves the probability of generating win responses over loss responses. During this process, it improves the probability of important words (from win) and reduces the probability of less important words (from loss) thus solving for delayed reward problem.
  • How does DPO works? DPO is built upon ranking loss / bradley-terry model which tell model to increase the probability of policy win response over probability of reference win response. for example the $P_{\text{ref}}(w|p)$ = 0.002. Since policy is initialized from ref, the loss says to increase the probability of $P_{\text{policy}}(w|p)$. The same is done of loss case, i.e. reduce of probability of policy loss response relative to reference loss response

LLM
LLM
PREV
PREV
I
I
LLM
LLM
PREV
PREV
WANT
WANT
LLM
LLM
BREAK
BREAK
PREV
PREV
LLM
LLM
FREE
FREE
PREV
PREV
0.25
0.25
0.21
0.21
0.55
0.55
0.47
0.47
Next Word
Next Word
GenProb = 0.0027
GenProb = 0.0027
LLM
LLM
<EOS>
<EOS>
PREV
PREV
0.2
0.2
LogProb = -2.566
LogProb = -2.566
Probabilities
Probabilities
initalized 
initalized 
LLM ref
LLM ref
Prompt W
Prompt W
Prompt L
Prompt L
Response_Ref W
GenProb_Ref W
Response_Ref W…
Response_Ref L
GenProb_Ref L
Response_Ref L…
LLM policy
LLM policy
Prompt W
Prompt W
Prompt L
Prompt L
Response_P W
GenProb_P W
Response_P W…
Response_P L
GenProb_P L
Response_P L…
GenProb_P W >
 GenProb_Ref W 
GenProb_P W >…
GenProb_P L <
 GenProb_Ref L 
GenProb_P L <…
Text is not SVG – cannot display

  • Why do we need reference scores? Won’t they cancel out during back prop? Reference scores are important for scaling. Imagine that W has 10 tokens and L has 20 tokens. There will be an inherent different between the scale of log loss of both W and L. Thus the reference models scores helps your to optimize relative to the reference scores.

What Can Be Labels?

Generally, DPO can be performed with any sort of preference data. Preference of response can also be generated indirectly from machines. It can be used with any signal which can be transformed into preference data. For example,

  • comparison of generated output: summary A > summary B
  • binary classification to comparison: code compile > code not compile
  • regression / mean squared error: error to expected value , err less > err more
  • parse signals: parsable output > non parsable output
  • output of expected in a specific format

Curating Comparison Dataset

Comparison dataset is sampled through various LLMs and human generated responses. One can generate large number of samples from a LLM for same prompt with high temperature. This leads to response with different variations. Llama 2 sampled the response from models of different sizes. The preference labels for the responses can be generated by machines or via humans depending on the task. This comparison feedback is used to train model via DPO. Since, samples are collected with high temperature, this is probably why after human alignment models become more stable, as their noise has been removed.

References

  1. https://chat.openai.com/share/7455bd78-6627-455b-a59e-5bc974836e8e ↩︎

Subscribe

Please enable JavaScript in your browser to complete this form.
Name