Proj. CLJ Paper Reading: A Survey on LLM-as-a-Judge

Abstract

good words: subjectivity, variability, scale
Task: Survey of LLM-as-a-Judge, benchmark & evaluation of LLM-as-a-Judge systems
Core question: How can reliable LLM-as-a-Judge systems be built?
Github: https://github.com/IDEA-FinAI/LLM-as-Evaluator
策略：
1. improving consistency
2. mitigating biases
3. adapting to diverse assessment scenarios
有些段落写作质量差，有些很好

1. intro

本文Task:

探究LLM-as-a-Judge的可靠性策略
提高一致性，improving consistency
减少偏见，mitigating biases
提高对不同评估场景的适应性 adapting to diverse assessment scenarios
评估LLM-as-a-Kudge系统本身可靠性

2. Background and Method

Good sentences:
1. LLM-as-a-Judge 是 auto-regressive generative model
2. In scenarios with sparse reward signals, such as a binary success status (success/fail), the self-reflection model uses the current trajectory and persistent memory to generate nuanced and specific feedback.
3. LLM and human evaluations are more aligned in the context of pairwise comparisons compared to score-based assessments.
- Aligning with human judgement: The role of pairwise preference in large language model evaluators
1. pairwise comparative assessments outperform other judging methods in terms of positional consistency
- LLM comparative assessment: Zero-shot NLG evaluation through pairwise comparisons using large language models
Defs:
- E ←PLLM (𝑥 ⊕ C)
- E: The final evaluation obtained from the whole LLM-as-a-Judge process in the expected manner. It could be a score, a choice, or a sentence, etc.
- PLLM: The probability function defined by the corresponding LLM, and the generation is an auto-regressive process.
- 𝑥: The input data in any available types (text, image, video), which waiting to be evaluated.
- C: The context for the input 𝑥, which is often prompt template or combined with history information in dialogue.
- ⊕: The combination operator combines the input 𝑥 with the context C, and this operation can vary depending on the context, such as being placed at the beginning, middle, or end.
Basic subtasks: In-Context Learning, Model Selection, Post-processing Method and Evaluation Pipeline

2.1 In-context learning

the design of prompt
- generating scores
- solving true/false questions
- conducting pairwise comparisons
- making multiple-choice selections
the design of input
- type of variables(text, image, video...)
- positions of input(beginning, mid, end)
- manner(individually, in pairs, in batches)

2.1.1 Generating scores

scores:
- discrete or continuous
- range: 1-5, 0-100...
- main criteria
  - overall score, helpfulness, relevance, accuracy, level of details
- LLM-as-an-Examiner
  - Likert scale scoring functions
  - 先令LLM分别为各个维度打分，再令其打总分
  - predefined dimensions, e.g., accuracy, coherence, factuality and comprehensiveness
  - ```
  Evaluate the quality of summaries written for a news article. Rate each summary on four dimensions: {Dimension_1}, {Dimension_2}, {Dimension_3}, and {Dimension_4}. You should rate on a scale from 1 (worst) to 5 (best).
  Article: {Article}
  Summary: {Summary}
```

2.1.2 Solving Yes/No questions

通常在中间步骤使用， feedback loop
- self-optimization
- e.g., Reflexion
- Q: 在对齐任务的检查中很常用
  - 是普通的“是否对齐”的问题，还是有什么特别的？

2.1.3 Conducting pairwise comparisons

pairwise comparison机器打分同人类打分更一致
在positional consistency上更好
可以是list wise comparisons, advanced ranking algos, data filtering
选项可以不只是是和否，还能是three-option(yes,no,tie)，four-option(yes,no, both good, both bad)

2.1.4 Making multiple-choice selections

rarer
can assess deeper understanding or preferences

You are given a summary and some semantic content units. For each semantic unit,
choose those can be inferred from the summary, return their number.
Summary: {Summary}
Semantic content units:
1. {SCU_1}
2. {SCU_2}
......
n. {SCU_n}

2.2 Model Selection

2.2.1 General LLM

e.g., AlpacaEval: An Automatic Evaluator of Instruction-following Models:用text-davinci-003或者gpt4_turbo作为baseline去和其他模型比较
e.g.2, GPT4

2.2.2 Fine-tuned LLM

PandaLM: data collection: Alpaca + GPT-3.5, model: LLaMA-7B finetune作为evaluator
JudgeLM: data collection: 多个instruction sets和GPT-4 annotations model: Vicuna
Auto-J: data collection: 多个场景的数据 model: both a generator and a evaluator
Prometheus: data collection: defines thousands of evaluation criteria and construct a feedback dataset based on GPT-4 model: a fine-grained evaluator model

Steps:

Data collection

data: instructions, the objects to be evaluated, and evaluations(答案可以can be GPT-4 or human annotations)

prompt design
Model finetuning

still instruction fine-tuning paradigm

2.3 Post-processing Method

Q: The evaluation format should align with our In-Context Learning design. 这句话有什么深意？还是仅仅说后处理不会改变

似乎只是说前后不冲突而已
Basic methods:
- extracting specific tokens
- normalizing the output logits
- selecting sentences with high returns

2.3.1 Extracting specific tokens

e.g.: Yes/NO, Need further eval/Do not need

识别困难：提供明确的指令或few-shot策略

2.3.2 Normalizing the output logits

通常会对输出logits进行归一化
- E.g., self-consistency and self-reflection scores
  - 主要利用生成时的概率分布，不过生成draft和rationale的是specialist LM，重新利用生成时概率分布重新计算的是generalist LM

2.3.3 Selecting sentences

e.g.: reasoning tree

Q: is this step consistently pick a sentence from a paragraph, or summarize a sentence from a paragraph or sth else?

2.4 Evaluation Pipeline

2.4.1 LLM-as-a-Judge for model

使用第三方的强大model来评估自身的model
Challenges:
1. 强大model往往是闭源的，成本较高
2. 强大model往往是闭源的，背后的参数和模型可能偷偷换，可复现性差
SelFee: 收集ChatGPT的回答(generations)，用于微调LLaMA，来生成critique model
Shepherd: 使用online communities + human anotation训练critique model(from scratch?)
PandaLM: 训练model, pairwise compare LLM Instruction Tuning Optimization
Zheng et al: 使用20K pairwise comparison dataset来finetune Vicuna，目的是减少成本
LMMs(Large Multimodal Models)
- GPT-4v, GPT-4o use pointwise and pairwise evaluation methods
- Prometheus-Vision: 允许使用user-designed scoring criteria来evaluate
  - Q: 但是它仍然需要预定义的criteria(it remains limited to predefined criteria)
- LLaVA-Critic: 使用多种数据集训练过，似乎可以作为generalist evaluator
  - Q: Can you tell me what is the difference between Prometheus-Vision and LLaVA-Critic? I mean, what are "predefined criteria" if Prometheus-Vision allows for user-designed scoring criteria? Is it like providing you with some of the basic criteria, and are the users allowed to composite them? What is the opposite-generalist evaluator, then? I suppose generalist means it isn't specially trained for a subdomain and has the same performance on all domains, but evaluator-generalist?

2.4.2 LLM-as-a-Judge for data

对数据进行标注
Reinforcement learning
- RLHF框架
- PPO: 在encoding和hyper parameter tuning上很复杂，且四个模型都需要平衡(Policy, Reward, Value, Old Policy)
- 使用SFT改进数据集: 1. hindsight-modified prompts 2. principle-driven self-alignment
- Aplaca: 使用aplaca prompts来获取数据
  - Q: 用的什么模型？
  - Q: 只是用来获取aligment的评分？不是用来直接标注一部分数据？所以只是用来挑生成的比较好的那部分数据？
WizardMath: 使用指令奖励模型(instruction reward model, IRM)
- 对每个原本数据集中的instruction，利用ChatGPT和Wizard-E生成2-4个变体。
- 这里似乎是为了拓展不同处，所以是必须要用chatgpt和wizard-E一起来，不是只用一个
- 用Wizard-E对这些拓展数据打分，打分的指标1. Definition 2. Precision 3. Integrity
MLLMs(Multimodal Large Language Models)
- MLLM-as-a-Judge: scoring, pair comparison, batch ranking

Good sentences:

which often cause hallucinations—outputs inconsistent with visual or contextual evidence

2.4.3 LLM-as-a-Judge for agent

2.5 Quick Practice

thinking

what is to be evaluated?
how do humans evaluate?
Any reliable evaluation examples?

Prompt Design

Scoring Dimension
Relative Comparison better for improving assessment
creating effective examples to guide LLMs

Model selection

reasoning ability
instruction the following ability

specification

using specific formats?
numerical scores
binary responses

3. Improvement strategy

Good Sentences:

the inherent biases of LLMs like length bias, positional bias and concreteness
bias[75] will lead to poor evaluation results

3.1 Design Strategy of Evaluation Prompts（In-Context Learning, C)

In-Context Learning: 利用提示中提供的相关实例或者说明来学习如何完成指定任务，而无需更新权重或者重新训练

3.1.1 Optimizing LLMs' Understanding of Evaluation Tasks

few-shot prompting
- FActScore
- SALAD-Bench
- GPTScore
Refining the evaluation task instructions
- Decomposition of Evaluation Steps: 将步骤分解地足够细，为每个步骤提供足够的定义和约束
  - G-Eval, DHP: CoT
  - SocREval: Socratic method
  - Branch-Solve-Merge: 将整个evaluation tasks分为多个并行的子任务，再综合起来
- Decomposition of Evaluation Criteria: 将总目标分成多个子目标,再综合起来
- E.g., Fluency可以拆分为语法正确度, engagingness, Readability
- HD-Eval: 使用hierarchical criteria decomposition来对齐LLM evaluation
- Hu and Gao et al.: 定义了explicit hierarchical classification system，包含11个criteria，解决了LLM可能将不同criteria？混淆的问题(addressing the issue of LLMs potentially confusing different evaluation standards)
Optimize evaluation capabilities to mitigate certain shortcomings of LLMs
- position bias:
  - Wang et al. 分析验证positon bias的影响，使用swapping contents(交换次序？还是？）的方法mitigate, a calibration framework
  - Auto-J and JudgeLM: shuffling the texts
  - PandaLM: 标注swapping后被发现是Tie的，被position bias影响的evaluation results
- absolute scoring less robust than relative comparing
  - Liu et at. Pairwise-Preference Search(PARIS)将scoring evaluation转化为ranking evaluation

3.1.2 Optimizing LLMs' Output Forms

Q: 和Optimization Strategy of Final Results的区别是？

Output可能会有预料之外的突变，比如要求打一个具体的分数，却直接输出了文字形式的"low relevance"
Method1: constrain LLMs' output in structured formats within prompts(due to the inherent generative randomness of LLMs)
- G-Eval, DHP: 让输出遵从form-filling paradigm，约束输出格式为"dimension/metric: scores/tokens"
- LLM-EVAL: codifies the form-filling paradigm, output the results in JSON dictionary format
Output的输出可能缺乏解释
- Q: The meaning of evaluation results from LLM evaluators is difficult to align consistently with instructions and metrics provided in prompts.
  - 似乎就只是单纯是1. 可能不遵从指令 2. 可能不遵从metrics的输出格式的意思
  - Inconsistencies between scores and explanations can occur.
- CLAIR: 要求LLM同时输出score和原因，JSON格式
- FLEUR: 先使用LLaVA评估图像标题获取质量分数，再以images， captions和scores作为输入问"why? Tell me the reason."

3.2 Improvement Strategy of LLMs' Abilities(PLLN)

3.2.1 Fine-tuning via Meta Evaluation Datasets

Meta Evaluation: Assessing how well the assessment itself is done.
似乎也和普通的evaluation dataset没有区别？似乎只是“不用来微调，只是用来测试”的数据集
收集数据集的方法：
- 从公共数据集中取样选择一些evaluation questions，然后用模板更改他们，再用A common method involves sampling evaluation questions from publicly available datasets, modifying them with certain templates, and supplementing the dataset with evaluation responses generated either manually or by powerful LLMs like GPT4
- PandaLM: 从Alpaca 52K中samples inputs and instructions, 使用GPT3.5创建训练数据
- SALAD-Bench: 从LMSYS-Chat和Toxicchat中获取部分训练数据
- OffsetBias: 从GPT4, GPT3.5，以及多个数据集中获取训练数据，basemodel是LLaMA3-8B-Instruct, FsfiarX-LLaMA3-RM-v0.1
  1. OffsetBias
    1. 使用GPT4生成off-topic(完全无关的话题）
    2. 用GPT3.5生成遵照off-topic回答的bad response
    3. 用good response, bad response来微调模型，减少bias
  2. EvalbiasBench
    1. 分析存在的meta-evaluation中的错误，归类这些bias
    2. 人工构造每种bias对应的prompts，再使用测试验证这些prompts输出错误的概率确实比较高
    3. 用来验证bias是否存在
  3. 研究的bias: length, concreteness, empty reference, content continuation, nested instruction, familiar knowledge
- JudgeLM: 通过reference support 或者reference drop等方式创建不同模式的训练数据
- CritiqueLLM: multi-path prompting approach, 结合了pointwise-to-pairwise, referenced-to-reference-free 提示策略，将referenced pointwise grading data分为4种类型，创建了Eval-Instruct（Q: 这是一个模型？一个数据集？还是？）用来fine-tune LLMs，目的：弥补pointwise grading和pairwisse comparison的弱点（Q: 什么弱点？）

3.2.2 Iterative Optimization Based on Feedback of Evaluation Results

Q: However, LLM-as-a-judge may still introduce biases during evaluation process in practice, which can impact the overall evaluation quality.
optimize based on 1. stronger models 2. human evaluators' correction of evaluation results
INSTRUCTSCORE:
1. 收集度量输出的故障模式（收集metric outputs的failure modes）
2. 根据每种failure mode，查询GPT-4，收集feedback
3. 选择最能与人类偏好对齐的解释
4. 重新finetune模型
JADE:
1. 利用human judges来更正LLM's evaluation，然后将most frequently corrected samples来做few-shot prompting

3.3 Optimization Strategy of Final Results(Post Processing <-E）

分成了三种，但是好像只写了两种？

integration of multiple evaluation results
direct optimization of LLMs' outputs
conversion of evaluation tasks from pointwise evaluation to pairwise comparison

3.3.1 Integration of Multiple Evaluation Results

在same content上用不同的超参数和设置跑多次然后根据这些results做总结

Sottana et al. 对多次运行求平均值以减少randomness
PsychoBench: 使用10个独立runs的平均值和标准差
Auto-J: 放大多个evaluation rounds之间的差距，将有/无场景标准的评估结果结合起来。which combine critiques with/without scenario criteria to obtain the final results
- scenario criteria: specific guidelines for evaluating the quality of LLM responses in a particular scenario

在same content上用不同的evaluators同时跑评估

CPAD: 使用ChatGLM-6B, Ziya-13B, ChatYuan-Large-v2，使用投票来整合
Bai et al. 提出decentralized peer review of LLMs
- Q: 似乎没有什么方法记忆点？

3.3.2 Direct Optimization of LLMs' Outputs

Good sentences:

Due to the inherent randomness in LLMs’ generation, the scores may not fully reflect the LLMs’ complete view of the evaluation criteria

Combine the implicit logits which capture the LLMs' randomness with the explicit output scores. 将捕捉到LLM随机性的隐式logit和显式输出分数相结合
- FLEUR: 分数平滑策略，用0-9对应的作为token的概率作为权重来平滑最终的显式分数
  - Q: 具体是怎么平滑的，不是只有一个显式输出分数么
- 需要能够访问token概率的接口
用self-verification过滤掉robustness不够的评估结果
- TrueTeacher: 评估蒸馏数据时：使用自验证，询问LLM evaluator对答案的certainty

4. Evaluation of LLM Evaluators

4.1 Basic Metric

用人工标注来对一致性打分（considering the LLM evaluator as a virtual annotator and evaluating the extent of its agreement with human annotators)

agreement in Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges:
- ```
\[
\text{Agreement} = \frac{\sum_{i \in \mathcal{D}} \mathbb{I}(S_{\text{llm}} = S_{\text{human}})}{\|\mathcal{D}\|}
\]
```
- where D is the dataset, 𝑆llm and 𝑆human is the evaluation result of LLM evaluator and human judge
Coken's Kappa, Spearman's correlation
将LLM-as-a-judge任务视作分类任务，以人类的annotation作为labels，则以precison, recall, f1 score为metric

meta-evaluation benchmark

MTBench: 80 human-crafted queries, each with several LLMs’ responses and expert-level human annotation on pairwise comparison
Chatbot Arena: more than 30K queries from real-world users and their vote on pairs of responses from different LLMs
FairEval: 基于VicunaBench的80个queries，在chatgpt和Vicuna生成的response之中进行人工标注
PandaLM: 建构了999个pairwise samples对应的test set,其中252个来自Self-instruct: Aligning language model with self generated instructions
Shepherd: 收集了352个样本作为critique model的test set
- critique model: Generate natural language feedback that pinpoints specific issues, such as factuality, coherence, and alignment with the user's intent.
- 和LLM-as-a-judge的不同：算是输出更多，提供修复信息的LLM-as-a-judge或者evaluator。The sources highlight that Shepherd, as a critique model, can provide "actionable ideas for refinement," often drawing on deep domain knowledge. This means the feedback goes beyond simply pointing out errors; it helps users understand how to fix them.
Evaluating large language models at evaluating instruction following: 创建一个meta-evaluation benchmark，Q: focusing only a part of the LLM-as-a-Judge pipeline

LLM evaluators are used for automatically annotating largescale datasets

差别：需要更加着重于：1. 正确性 2. annotations的一致性 3. 与训练目标的一致性

4.2 Bias

Position Bias

def: LLM evaluators favoring responses in certain positions within the prompt
- e.g., 在pairwise comparison中，某个模型更倾向于第二名的Vicuna-13B而不是第一位的ChatGPT
  - Large Language Models are not Fair Evaluators Vicuna-13B放在2nd place就能比chatgpt评分更好。
  - GPT-4 tends to favor the first position, while ChatGPT shows a preference for the second position
- Solutions:
  1. 2 new metrics:
  2. Position Consistency: how frequently a judge model selects the same response after changing their positions
  3. Preference Fairness: the percent of disagreement after change the position of two candidate responses.
  4. Large language models are not fair evaluators:
  - new metric: Conflict Rate: measure the percent of disagreement after change the position of two candidate responses.

Length Bias

Def: 对特定长度回答的偏好，例如verbosity bias，对更冗长的回答的偏好。
Effects: 不会引入新信息，但是可能导致困惑，流畅度降低，风格变化 Even though these expansions do not introduce new information, there is still concern regarding changes to the original response in terms of perplexity, fluency, or style
- 不会前后矛盾？
Verbosity bias in preference labeling by large language models

Self-Enhancement Bias

LLM evaluators更偏好自己生成的回答。
- This is only a stopgap, as we may not use the optimal evaluator when evaluating the most advanced LLMs.

Other Bias

bias against certain demographic groups: genders, race, sexual orientation
visually appealing content, e.g.: text with emoji
concreteness bias: 偏好细节更多的，数字更多的，引用更多的，术语更复杂的
authority bias, citation bias,
sentiment bias: response with certain emotional tones, such as cheerful, sad, angry, and fearful

Challenges

Need for Systematic Benchmark:

e.g., EVALBIASBENCH
CALM(Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge), unified bias quantification framework, 12 types of bias
Q: Despite these efforts, there is still no systematic benchmark and dataset that includes all types of biases

Challenges of Controlled Study

研究某种bias时，很难将其他bias或者quality-related 特征的影响隔离
- e.g., Length bias，加长response可能会导致style, fluency或者coherence的改变。
- e.g., self-enhancement bias，比起gpt-3.5, gpt-4更喜欢自身生成的reponse，可能是gpt-4高质量？

4.3 Adversarial Robustness

Good sentences:
1. Adversarial robustness refers to the ability of a model to withstand deliberate attempts to manipulate the scores through carefully crafted inputs.
Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment
- construct a surrogate model(替代模型) from the black-box LLM evaluator, learn an adversarial attack phases(对抗攻击短语)，然后插入这些attack phases而不提高text quality来攻击
Cheating automatic llm benchmarks: Null models achieve high win rates.
- 一直输出无关回答的null model也可以在LLM-as-a-judge中得到high win rates
Benchmarking cognitive biases in large language models as evaluators; Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
1. 添加类似于”90%认为这更好“这样的暗示语句
2. meaningless statement，e.g., "Assistant A loves eating pasta"
- LLM-as-a-judge are still insufficiently robust against interference irrelevant to text quality
Q: perplexity score只能检测limited types of adversarial examples

5. Meta-Evaluation Benchmark

Q: there is still a lack of meta-evaluation on whether these improvement strategies
effectively optimize the LLM evaluators and which dimensions of evaluation performance are being enhanced. It is possible that some improvement strategies fail to enhance the LLM evaluators’ performance or mitigate biases in practical use, leading to computing waste.

5.1 Experiment Settings

5.1.1 Evaluation Dimensions and Benchmarks

LLMEval: assess the alignment of LLM-as-a-judge with human evaluations.
- 2553 samples. multiple data sources, Each sample consists of a question, a pair of candidate responses, and a human label indicating the preferred response.
EVALBIASBENCH: to measure six types of biases in LLMs, including length bias, concreteness bias, empty reference bias, content continuation bias, nested instruction bias, and familiar knowledge bias.
- 80 samples, each containing a question, a pair of candidate responses, and a label indicating the correct response without bias influence. In addition to the six types of biases, we also evaluated position bias

5.1.2 Evaluation Metrics

Percentage Agreement Metric
Accuracy
Position Consistency

Formally, given 𝑁 samples {(𝑞𝑖 , 𝑟1𝑖 , 𝑟2𝑖)}𝑁 𝑖=1 , for each sample (𝑞𝑖 , 𝑟1𝑖 , 𝑟2𝑖), we query the LLM evaluator with two prompts 𝑃 (𝑞𝑖 , 𝑟1𝑖 , 𝑟2𝑖) and 𝑃 (𝑞𝑖 , 𝑟2𝑖 , 𝑟1𝑖), and obtain corresponding two evaluation results 𝑆 𝑟12 𝑖 and 𝑆 𝑟21 𝑖 .Each 𝑆𝑖 is 𝑟1𝑖 , 𝑟2𝑖 or "TIE". Then we calculate the position consistency as follows:where I(·) is the indicator function

5.1.3 Target LLMs and Strategies

evaluators: g closed-source LLMs GPT-4, GPT-3.5, and open-source LLMs Qwen2.5-7B, LLaMA3-8B, Mistral-7B, and Mixtral-8×7B
improvement strategies: Providing Evaluations with Explanations, Self Validation,
Summarize by Multiple Rounds, and Vote by Multiple LLMs
- base evaluator: GPT-3.5

5.1.4 Model Configuration

gpt-4-turbo-2024-04-09 and gpt-3.5-turbo-0125, Qwen2.5-7B-Instruct, Meta-Llama-3-8B-Instruct, Mistral7B-Instruct-v0.3, Mixtral-8×7B-Instruct-v0.1
an Ubuntu machine equipped with a 40GB NVIDIA A100 GPU
temperature = 0
Summarize by Multiple Rounds: 5 rounds for multiple rounds, majority voting(- majority@5), taking the mean score(- mean@5), and taking the best score(- best@5).
Vote by Multiple LLMs: Setting 1 consists of GPT-4-turbo, GPT-3.5-turbo, and LLaMA3-8B-Instruct, while setting 2 consists of GPT-4-turbo, GPT-3.5-turbo, and Qwen2.5-7B-Instruct.
https://platform.openai.com/docs/models
https://huggingface.co/Qwen/Qwen2.5-7B-Instruct
https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3
https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1

5.2 Experiment Results and Analysis

5.2.1 Comparison with Different LLMs

LLaMA3-8B-Instruct和GPT-3.5在大多数指标上性能相似
apart from Concreteness Bias and Content Continuation Bias, the performance of LLMs except GPT-4-turbo was generally poor, particularly in the Length Bias.
even GPT-4-turbo experienced substantial performance degradation in Empty Reference Bias and Nested Instruction Bias.

5.2.2 Comparison with Different Strategies

Providing Evaluations with Explanations 可能引入更多bias
Self Validation效果最小，可能与overconfidence有关
Summarize by Multiple Rounds: 投票最好，平均值或者最好值都无法消除randomness
Vote by Multiple LLMs:
- In set 1, the poor performance of GPT-3.5-turbo and LLaMA3-8B-Instruct in the Length Bias negatively impacted the overall performance, whereas
- in set 2, the performance in this dimension was better, which was aligned with Qwen2.5-7B-Instruct

5.2.3 Summary

to select more powerful LLMs and to adopt two evaluation strategies: one is swapping the positions of the evaluation contents, the other is taking the majority voting results from multiple rounds of evaluation

6. Application

6.1 Machine Learning

6.1.1 NLP

sentiment analysis:

Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge: Q: 许多影响判断的biases?

translation:

Llm-as-a-judge & reward model:What they can and cannot do: 在评估非英语环境的文化和事实准确性时，由于大多数模型都依赖于英文的培训预料，所以会有限制。(For translation, studies have shown that the effectiveness of LLM evaluators depends heavily on their English training, creating limitations in assessing cultural and factual accuracy in non-English contexts)

text summarization:

A field guide to automatic evaluation of llm-generated summaries: new metrics to better capture semantic qualities and minimize hallucinations

6.2 Other specific domains

6.2.1 Finance

task: forecasting, anomaly detection, personalized text generation
Type
- expert knowledge
  - a case study on multi-task fine-tuning in finance
  - FinCon, a multi-agent system that uses conceptual verbal reinforcement
- benchmark to enhance the understanding of expert knowledge
  - UCFE: user-feedback based
  - IndoCareer: professional exam questions
  - AI-generated domain specific evaluation sets

6.2.2 Law

特点：对bias和事实不准确更敏感
Type
- develop LLM evaluators(developing LLM evaluators specifically for legal applications by addressing professional limitations or designing evaluators themselves)
  - a four-dimensional framework for constructing responsible LLMs for legal advice, emphasizing (a) user attributes and behaviors, (b) the nature of queries, (c) AI capabilities, and (d) social impacts
  - Eval-RAG, a retrieval-augmented generator (RAG)- based evaluator that assesses the validity of LLM-generated legal texts. T
- benchmark
  - 常与特定区域、语言有关
  - ethics [149] and harmfulness [1].

6.2.3 AI for Science

医学: prompt engineering, expert knowledge
- LLaMA2 can assess clinical notes and Q&A responses
数学: RL, cooperative reasoning methods
- WizardMath
- a Cooperative Reasoning (CoRe) framework that combines generation and verification to mimic human-like dual-process reasoning
- MathVista, a benchmark for evaluating mathematical reasoning in visual contexts

6.2.4 Others

Software Engineering:
- evaluate bug report summarizations
automated essay scoring and revising: few-shot learning and prompt tuning, revising
identify rule violations on platforms
assessing user preferences based on personas
evaluating service quality
analyzing user experience feedback
assessing creative content like art or literature reviews

7. Challenges

7.1 Reliability

Overconfidence:
- Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback:they tend to offer overly favorable scores when evaluating their own responses
Fairness and Generalization:
Good sentences: Evaluations by LLM-as-a-judge can exhibit considerable inconsistency depending on the context. This is why prompt-based methods are often used to improve LLM-as-a-judge performance
LLMs struggle to handle long context windows effectively, often showing degraded performance or prioritizing later examples in the sequence

7.2 Robustness

常常利用模型决策过程中的偏见，不一致或者漏洞exploit biases, inconsistencies, or loopholes
Defense: post-processing techniques, such as response filtering and consistency
checks
Challenges:
- self-consistency
- random scoring
Good Sentences:
- Unlike traditional adversarial attacks on natural language generation (NLG), where the goal is often to mislead the model into generating harmful or incorrect outputs, attacks on LLM-as-a-Judge aim to exploit biases, inconsistencies, or loopholes in the model’s decision-making processes.

7.3 Powerful Backone Model

GPT-4 Vision, still struggle with complex reasoning across different modalities

8. Future Work

8.1 More Reliable LLM-as-a-Judge

Good Sentences:
- There is considerable potential for improving reliability in various aspects, including In-Context Learning, model selection, post-processing techniques, and the overall evaluation framework for LLM-as-a-Judge
- the uncertain and evolving nature of robustness risks underscores the necessity of proactive mitigation strategies. These strategies should include the development of adversarial training techniques tailored to judgment tasks, the integration of robust uncertainty quantification methods, and the implementation of human-in-the-loop systems to oversee critical decisions.

8.2 LLM-as-a-Judge for Data Annotation

8.3 MLLM-as-a-Judge

8.4 More LLM-as-a-Judge Benchmarks

数据集涵盖更广泛的场景，更复杂的现实世界内容，更细粒度的评估指标

8.5 LLM-as-a-Judge for LLM Optimization

9. Conclusion

posted @ 2024-12-21 00:46 雪溯阅读(79) 评论(0) 收藏举报

刷新页面返回顶部

雪溯

总之心情不好的话大概就会来这边做两道OJ，此处顺便储存部分笔记