Proj CJI Paper Reading: OffsetBias: Leveraging Debiased Data for Tuning Evaluators
目的: reduce bias of LLMs(length, concreteness, empty reference, content continuation, nested instruction, familiar knowledge)
Tool:
- OffsetBias: pairwise preference dataset
- EVALBiasBench:meta-evaluation benchmark
Method:
- OffsetBias
- 使用GPT4生成off-topic(完全无关的话题)
- 用GPT3.5生成遵照off-topic回答的bad response
- 用good response, bad response来微调模型,减少bias
- EvalbiasBench
- 分析存在的meta-evaluation中的错误,归类这些bias
- 人工构造每种bias对应的prompts,再使用测试验证这些prompts输出错误的概率确实比较高
- 用来验证bias是否存在
注意:这里off-topic不会作为用于防止注入的data
Abstract
Github: https://github.com/ncsoft/offsetbias
5. Experimental Setup
5.1 Model Description
- Judge models
- Base-data
- Basedata + OFFSETBIAS
- Base-data: 268k human preference dataset
- Ultrafeedback (Cui et al., 2023): single scoring task
- Helpsteer (Wang et al., 2023b): single scoring task
- HH-RLHF-Helpful-Online, HH-RLHF-Harmless-Base (Bai et al., 2022): pairwise comparison
- a subset of PKU-SafeRLHF (Dai et al., 2024): pairwise comparison
- 加入single scoring task能显著提升pairwise comparison的效果
- augmenting data to stress position bias:把pair调转过来, double the size of dataset
- Test OFFSETBias on training reward model
- 意义
- Q: 消除judge model中prompting和feedback generation的影响,只留下(I, Rg, Rb)的影响。(eliminate the influence of prompting and feedback generation in judge model performance, leaving only the impact of the (I, Rg, Rb) triplets of the data.
- Challenge
- 直接训练already fine-tuned reward models with new data会导致灾难性的遗忘catastrophic forgetting
- Solution:
- Warm: On the benefits of weight averaged reward models
- 挑选FsfiarX-LLaMA3-RM-v0.1作为original model
- 利用original 模型的部分训练数据和OFFSETBias一起训练中间reward model
- 使用SLERP方法将中间reward model和original mode结合起来成为final model
- Solution:
- 直接训练already fine-tuned reward models with new data会导致灾难性的遗忘catastrophic forgetting
- 意义
5.2 Benchmarks
- Generative models
- LLMBar: a Natural subset and four Adversarial subsets, named Neighbor, GPTInst, GPTOut and Manual based on their construction methods.
- HHH-Alignment: helpfulness, honesty, harmlessness and others
- MT-Bench Human Judge: 80 prompts from the MT-Bench. Human annotators labeled 3.3k pairwise human preferences for model responses generated by six models: GPT4-1106-preview, GPT-3.5-turbo-0125, Claude-v1, Vicuna-13B, Alpaca-13B, and LLaMA-13B.
- Reward model
- RewardBench: Chat, Chat Hard, Safety, and Reasoning.
5.3 Baselines
- Generative model baselines:
- OpenAI’s GPT-4o-2024-05-13 and GPT-3.5-turbo0125 as proprietary baselines, PandaLM (Wang et al., 2024), AutoJ (Li et al., 2024) and Prometheus2 (Kim et al., 2024b) as state-of-theart evaluator models, and LLaMA-3-8B-Instruct (AI@Meta, 2024) as a baseline model.
- We adopt original prompt templates of the models for fair comparison
- EvalBiasBench + Generative model baselines
- Phi-3-medium (Microsoft, 2024), Mixtral-8x-7Binstruct (MistralAI, 2024), LLaMA2-Chat-70B(GenAI@Meta, 2023) and LLaMA3-70B-Instruct (AI@Meta, 2024)
- Reward model baselines
- Eurus-RM-7B (Yuan et al., 2024), Starling-RM-34B (Zhu et al., 2023a), RMMistral-7B and FsfairX-LLaMa3-RM (Xiong et al., 2024)
6 Experiment Results
- 使用了macro average(类别之间直接平均sum/类别数,而不是按照数目加权平均)
- 使用了metric: positional agreement rate
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】凌霞软件回馈社区,博客园 & 1Panel & Halo 联合会员上线
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】博客园社区专享云产品让利特惠,阿里云新客6.5折上折
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· DeepSeek “源神”启动!「GitHub 热点速览」
· 我与微信审核的“相爱相杀”看个人小程序副业
· 微软正式发布.NET 10 Preview 1:开启下一代开发框架新篇章
· 如何使用 Uni-app 实现视频聊天(源码,支持安卓、iOS)
· C# 集成 DeepSeek 模型实现 AI 私有化(本地部署与 API 调用教程)
2020-12-30 Proj THUDBFuzz Paper Reading: VulSeeker-Pro: Enhanced Semantic Learning Based Binary Vulnerablity Seeker with Emulation