Proj. CLJ Paper Reading: A Survey on LLM-as-a-Judge

Abstract

  • good words: subjectivity, variability, scale
  • Task: Survey of LLM-as-a-Judge, benchmark & evaluation of LLM-as-a-Judge systems
  • Core question: How can reliable LLM-as-a-Judge systems be built?
  • Github: https://github.com/IDEA-FinAI/LLM-as-Evaluator
  • 策略:
    1. improving consistency
    2. mitigating biases
    3. adapting to diverse assessment scenarios
  • 难以理解的啰嗦

1. intro

本文Task:

  1. 探究LLM-as-a-Judge的可靠性策略
  2. 提高一致性,improving consistency
  3. 减少偏见,mitigating biases
  4. 提高对不同评估场景的适应性 adapting to diverse assessment scenarios
  5. 评估LLM-as-a-Kudge系统本身可靠性

2. Background and Method

  • Good sentences:

    1. LLM-as-a-Judge 是 auto-regressive generative model
    2. In scenarios with sparse reward signals, such as a binary success status (success/fail), the self-reflection model uses the current trajectory and persistent memory to generate nuanced and specific feedback.
    3. LLM and human evaluations are more aligned in the context of pairwise comparisons compared to score-based assessments.
    • Aligning with human judgement: The role of pairwise preference in large language model evaluators
    1. pairwise comparative assessments outperform other judging methods in terms of positional consistency
    • LLM comparative assessment: Zero-shot NLG evaluation through pairwise comparisons using large language models
  • Defs:

    • E ←PLLM (𝑥 ⊕ C)
    • E: The final evaluation obtained from the whole LLM-as-a-Judge process in the expected manner. It could be a score, a choice, or a sentence, etc.
    • PLLM: The probability function defined by the corresponding LLM, and the generation is an auto-regressive process.
    • 𝑥: The input data in any available types (text, image, video), which waiting to be evaluated.
    • C: The context for the input 𝑥, which is often prompt template or combined with history information in dialogue.
    • ⊕: The combination operator combines the input 𝑥 with the context C, and this operation can vary depending on the context, such as being placed at the beginning, middle, or end.
  • Basic subtasks: In-Context Learning, Model Selection, Post-processing Method and Evaluation Pipeline

2.1 In-context learning

  • the design of prompt
    • generating scores
    • solving true/false questions
    • conducting pairwise comparisons
    • making multiple-choice selections
  • the design of input
    • type of variables(text, image, video...)
    • positions of input(beginning, mid, end)
    • manner(individually, in pairs, in batches)

2.1.1 Generating scores

  • scores:
    • discrete or continuous
    • range: 1-5, 0-100...
    • main criteria
      • overall score, helpfulness, relevance, accuracy, level of details
    • LLM-as-an-Examiner
      • Likert scale scoring functions
      • 先令LLM分别为各个维度打分,再令其打总分
      • predefined dimensions, e.g., accuracy, coherence, factuality and comprehensiveness
      • Evaluate the quality of summaries written for a news article. Rate each summary on four dimensions: {Dimension_1}, {Dimension_2}, {Dimension_3}, and {Dimension_4}. You should rate on a scale from 1 (worst) to 5 (best).
        Article: {Article}
        Summary: {Summary}
        

2.1.2 Solving Yes/No questions

  • 通常在中间步骤使用, feedback loop
    • self-optimization
    • e.g., Reflexion
    • Q: 在对齐任务的检查中很常用
      • 是普通的“是否对齐”的问题,还是有什么特别的?

2.1.3 Conducting pairwise comparisons

  • pairwise comparison机器打分同人类打分更一致
  • 在positional consistency上更好
  • 可以是list wise comparisons, advanced ranking algos, data filtering
  • 选项可以不只是是和否,还能是three-option(yes,no,tie),four-option(yes,no, both good, both bad)

2.1.4 Making multiple-choice selections

  • rarer
  • can assess deeper understanding or preferences
  • You are given a summary and some semantic content units. For each semantic unit,
    choose those can be inferred from the summary, return their number.
    Summary: {Summary}
    Semantic content units:
    1. {SCU_1}
    2. {SCU_2}
    ......
    n. {SCU_n}
    

2.2 Model Selection

2.2.1 General LLM

e.g., AlpacaEval: An Automatic Evaluator of Instruction-following Models:用text-davinci-003或者gpt4_turbo作为baseline去和其他模型比较
e.g.2, GPT4

2.2.2 Fine-tuned LLM

  1. PandaLM: data collection: Alpaca + GPT-3.5, model: LLaMA-7B finetune作为evaluator
  2. JudgeLM: data collection: 多个instruction sets和GPT-4 annotations model: Vicuna
  3. Auto-J: data collection: 多个场景的数据 model: both a generator and a evaluator
  4. Prometheus: data collection: defines thousands of evaluation criteria and construct a feedback dataset based on GPT-4 model: a fine-grained evaluator model

Steps:

  1. Data collection
  • data: instructions, the objects to be evaluated, and evaluations(答案可以can be GPT-4 or human annotations)
  1. prompt design
  2. Model finetuning
  • still instruction fine-tuning paradigm

2.3 Post-processing Method

Q: The evaluation format should align with our In-Context Learning design. 这句话有什么深意?还是仅仅说后处理不会改变

  • 似乎只是说前后不冲突而已

  • Basic methods:

    • extracting specific tokens
    • normalizing the output logits
    • selecting sentences with high returns

2.3.1 Extracting specific tokens

e.g.: Yes/NO, Need further eval/Do not need

识别困难:提供明确的指令或few-shot策略

2.3.2 Normalizing the output logits

  • 通常会对输出logits进行归一化

E.g., self-consistency and self-reflection scores

2.4 Evaluation Pipeline

3. Improvement strategy

3.1 Design Strategy of Evaluation Prompts

3.2 Improvement Strategy of LLMs' abilities

3.3 Optimization Strategy of Final Results

4. Evaluation of LLM Evaluators

5. Meta-Evaluation Benchmark

6. Application

7. Challenges

8. Future Work

posted @ 2024-12-21 00:46  雪溯  阅读(4)  评论(0编辑  收藏  举报