LLM-as-a-Judge for AI Systems

Introduction

  1. Why evaluation?: Evaluation of AI systems is costly and time-consuming. At the same time, evaluations are important to identify gaps in the current systems. Consider a search system: for a given query, you need to identify the right answer from a set of millions of documents. Similarly, imagine a response from an LLM: for a given prompt and response, how long will it take you to check if the response is apt for the prompt? (factualness, relevance, groundedness)
  2. How did it use to happen? Traditionally, these labels were annotated by humans in a large-scale setting. Amazon’s Mechanical Turk and a number of new startups have come up to provide large-scale human annotations. Collecting large scale human annotations have problems, from annotation quality, noisy raters, noisy labels, to cost and time. You can read more about collecting quality human labels here
  3. How LLMs can help?: With the advances in LLMs and at-par human quality model in multiple domains, LLMs can be used instead of human annotators to collect large-scale annotations. This solves the time and cost problem while evaluating ML systems. A good example of this can be observed in RLHF training. During RLHF, we train a ranking model based on human-collected ratings. Next, we use that fine-tuned ranking model to provide real-time feedback for various new <prompt, response> pairs generated by the model
  4. Challenges with LLM-as-a-Judge: This is a new trend that is helping engineers collect large-scale labels for their ML systems. It often suffers from various biases that are creeped from its training process. Example of some biases: verbosity bias, limited reasoning bias, knowledge cutoff bias.

Common Patterns of LLM-as-a-Judge

  1. The specific method for using the LLM-as-a-judge will depend on the task. Generally, there are two common approaches: a. Pairwise judgements b. Standalone judgements. Evaluation of multiple LLM response is generally done by pairwise judgements. Similarly, evaluation of query, item pair for relevance can be done using standalone judgements.
  2. Pairwise Judgements uses two variants of the output to generate the judgement. Imagine a prompt and two responses sampled from two different LLM models. The judge model will be passed with the prompt and both responses and will be asked to generate which one is better. This type of judgement is required when evaluating complex problem with no right answers. For example, summarization, chat response, human alignment, etc. Judge model is asked to generate which response is better and why.
  3. Standalone Judgements asks judge model to generate rating based on single response from single model on absolute scale. Taking the previous example, judge model will be inferred with prompt and response and will be asked to generate response on the scale of 1-5, or classify the response on specific factor like helpfulness or factualness.
  4. Tradeoff: For certain problems, pairwise judgements cannot be scaled as the number of pair increases quadratically with increase in the number of responses sampled. Similarly, standalone judgements lacks stability and suffers from fluctuation.

Method

Basic

  1. Problem Identification: Imagine, a task like search retrieval. Traditionally, the evaluation of retrieval models are hard because it required you to collect relevance signal for each {query, item} pair retrieved by the system. For each query, the system retrieves items in scale of 100-100K. Getting annotations for this scale for a single query is not feasible and thus make a perfect case of using LLM-as-a-judge.
  2. Model Selection: Use models like GPT-4, Claude 3 and Gemini 1.5 to perform judgements. These are big models which a tuned for performance on wide variety of tasks, and thus are more likely to perform well. I keep the temperature and top_p low because we want judge model to generate stable non random answer and with less creativity.
  3. Data Collection and Inference: Collect your data which is to be evaluated. Write a prompt template to be used with judge model. A sample prompt for retrieval problem can be found below. Next, use the template to generate inference on the dataset from judge model.
You are an expert search engine optimizer. You will be given a query and a document. You need to provide judgements on the relevance level of the document for a user searching for the query. You can provide following judgements:
- Highly Relevant: The document is exactly what the user is looking for
- Relevant: The document contains common topics which user is looking for
- Not Relevant: The document is not at all relevant to what user is looking for

Please provide assessment for the following in YAML format with key "judgement"
Query: {query}
Document: {document}
  1. Parsing Output: Post-process and parse the response into structured format. Judge model doesn’t always follow the instruction tightly and generates other noisy text surrounding the actual output. Parsing in such cases becomes difficult. I generally ask model to generate output in YAML format. This helps in parsing the output even when other text is present. Use regex to handle such cases.
  2. Transform the output from models into the metrics which will be consumed for downstream use cases like monitoring/evaluating the system, finding failure cases, and data preparation.

Evaluating Judge Model

  1. The judgement system itself should also be evaluated for the task using a small dataset. Build an expert (human) judgements dataset O(100-1K) and evaluate the judge model. The quality of the dataset should be good as this will be used to tune judge model and knowledge from this dataset will be cascaded into whole system. Make sure to have agreement of > 0.8 between your LLM judge and expert judge.
  2. An example of agreement metric could be: MAE (mean absolute error) between labels mapped by “Not Relevant” → 0, “Relevant” → 1, “Highly Relevant” → 2. Another example of agreement is accuracy between classification labels.

Improving Judge Performance

  1. Try few shot prompt if zero shot prompt is not working well. IMHO, adding upto 4 examples generally improves model’s accuracy on the task. This comes at higher cost as more tokens will be added
  2. Try chain-of-thought. Ask model to first generate the reason and the generate the final judgement. I generally ask model to generate both reason and judgement as YAML. Ask the model to keep the reason short i.e. under 20 words.
  3. Try reference guided generation, zheng et. al. observed high error rates even with CoT prompts. They used reference guide generation, where the judge is also provided with response generated by judge for the given prompt. This response is passed to judge as “reference answer”. This can be used to judge generation tasks, not sure of other tasks.
  4. Try methods that generally improves LLM’s performance: reflexion, tree of thoughts, agentic prompting
  5. LLMs have shown to have positional bias when working with pairwise judgements. To handle positional bias, you can try randomly swapping answers or only accept answer if it remains consistent after swapping

Scaling Judgments

  1. For use cases like continuous monitoring, dataset preparation, you need to generate these judgements at large scale. Using models like GPT-4 can still be costly for generating judgement for millions of samples.
  2. Sampling Multiple Response: sample output from multiple models with multiple config. use small models to generate judgement, if ambiguity arises in those results, you can trigger large model. This saves cost with increase in latency. Ensemble methods can be used to generate final judgement from judgements from multiple smaller models. Again EVALUATE.
  3. Fine tuning smaller model like 8B or 13B with expert judgement can result in similar performance as big models. zheng et. al. fine-tuned a vicuna 13B models to use it as cheap proxy to GPT-4. The performance of fine-tuned vicuna is competitive.

Closing

  1. LLM-as-a-Judge is a new trend which replaces the otherwise costly and time consuming process of human evaluation. They are far from perfect but research suggest that near human level performance can be expected with current generation of models, if used correctly. The judgements generated with this methods required periodic check in it’s sanity and quality. Recently, the generative capabilities of LLM models have also lead to other methods like synthetic data generation for solving the data crunch problem in search and AI systems. While this trend is very well growing in NLP tasks, its applications in other domains like image and audio can be explored further.

References


Subscribe

Please enable JavaScript in your browser to complete this form.
Name