Proj. CLJ Paper Reading: A Survey on LLM-as-a-Judge
Abstract
- good words: subjectivity, variability, scale
- Task: Survey of LLM-as-a-Judge, benchmark & evaluation of LLM-as-a-Judge systems
- Core question: How can reliable LLM-as-a-Judge systems be built?
- Github: https://github.com/IDEA-FinAI/LLM-as-Evaluator
- 策略:
- improving consistency
- mitigating biases
- adapting to diverse assessment scenarios
- 难以理解的啰嗦
1. intro
本文Task:
- 探究LLM-as-a-Judge的可靠性策略
- 提高一致性,improving consistency
- 减少偏见,mitigating biases
- 提高对不同评估场景的适应性 adapting to diverse assessment scenarios
- 评估LLM-as-a-Kudge系统本身可靠性
2. Background and Method
-
Good sentences:
- LLM-as-a-Judge 是 auto-regressive generative model
- In scenarios with sparse reward signals, such as a binary success status (success/fail), the self-reflection model uses the current trajectory and persistent memory to generate nuanced and specific feedback.
- LLM and human evaluations are more aligned in the context of pairwise comparisons compared to score-based assessments.
- Aligning with human judgement: The role of pairwise preference in large language model evaluators
- pairwise comparative assessments outperform other judging methods in terms of positional consistency
- LLM comparative assessment: Zero-shot NLG evaluation through pairwise comparisons using large language models
-
Defs:
- E ←PLLM (𝑥 ⊕ C)
- E: The final evaluation obtained from the whole LLM-as-a-Judge process in the expected manner. It could be a score, a choice, or a sentence, etc.
- PLLM: The probability function defined by the corresponding LLM, and the generation is an auto-regressive process.
- 𝑥: The input data in any available types (text, image, video), which waiting to be evaluated.
- C: The context for the input 𝑥, which is often prompt template or combined with history information in dialogue.
- ⊕: The combination operator combines the input 𝑥 with the context C, and this operation can vary depending on the context, such as being placed at the beginning, middle, or end.
-
Basic subtasks: In-Context Learning, Model Selection, Post-processing Method and Evaluation Pipeline
2.1 In-context learning
- the design of prompt
- generating scores
- solving true/false questions
- conducting pairwise comparisons
- making multiple-choice selections
- the design of input
- type of variables(text, image, video...)
- positions of input(beginning, mid, end)
- manner(individually, in pairs, in batches)
2.1.1 Generating scores
- scores:
- discrete or continuous
- range: 1-5, 0-100...
- main criteria
- overall score, helpfulness, relevance, accuracy, level of details
- LLM-as-an-Examiner
- Likert scale scoring functions
- 先令LLM分别为各个维度打分,再令其打总分
- predefined dimensions, e.g., accuracy, coherence, factuality and comprehensiveness
-
Evaluate the quality of summaries written for a news article. Rate each summary on four dimensions: {Dimension_1}, {Dimension_2}, {Dimension_3}, and {Dimension_4}. You should rate on a scale from 1 (worst) to 5 (best). Article: {Article} Summary: {Summary}
2.1.2 Solving Yes/No questions
- 通常在中间步骤使用, feedback loop
- self-optimization
- e.g., Reflexion
- Q: 在对齐任务的检查中很常用
- 是普通的“是否对齐”的问题,还是有什么特别的?
2.1.3 Conducting pairwise comparisons
- pairwise comparison机器打分同人类打分更一致
- 在positional consistency上更好
- 可以是list wise comparisons, advanced ranking algos, data filtering
- 选项可以不只是是和否,还能是three-option(yes,no,tie),four-option(yes,no, both good, both bad)
2.1.4 Making multiple-choice selections
- rarer
- can assess deeper understanding or preferences
-
You are given a summary and some semantic content units. For each semantic unit, choose those can be inferred from the summary, return their number. Summary: {Summary} Semantic content units: 1. {SCU_1} 2. {SCU_2} ...... n. {SCU_n}
2.2 Model Selection
2.2.1 General LLM
e.g., AlpacaEval: An Automatic Evaluator of Instruction-following Models:用text-davinci-003或者gpt4_turbo作为baseline去和其他模型比较
e.g.2, GPT4
2.2.2 Fine-tuned LLM
- PandaLM: data collection: Alpaca + GPT-3.5, model: LLaMA-7B finetune作为evaluator
- JudgeLM: data collection: 多个instruction sets和GPT-4 annotations model: Vicuna
- Auto-J: data collection: 多个场景的数据 model: both a generator and a evaluator
- Prometheus: data collection: defines thousands of evaluation criteria and construct a feedback dataset based on GPT-4 model: a fine-grained evaluator model
Steps:
- Data collection
- data: instructions, the objects to be evaluated, and evaluations(答案可以can be GPT-4 or human annotations)
- prompt design
- Model finetuning
- still instruction fine-tuning paradigm
2.3 Post-processing Method
Q: The evaluation format should align with our In-Context Learning design. 这句话有什么深意?还是仅仅说后处理不会改变
-
似乎只是说前后不冲突而已
-
Basic methods:
- extracting specific tokens
- normalizing the output logits
- selecting sentences with high returns
2.3.1 Extracting specific tokens
e.g.: Yes/NO, Need further eval/Do not need
识别困难:提供明确的指令或few-shot策略
2.3.2 Normalizing the output logits
- 通常会对输出logits进行归一化
E.g., self-consistency and self-reflection scores