Proj CJI Paper Reading: OffsetBias: Leveraging Debiased Data for Tuning Evaluators

目的: reduce bias of LLMs(length, concreteness, empty reference, content continuation, nested instruction, familiar knowledge)
Tool:

  • OffsetBias: pairwise preference dataset
  • EVALBiasBench:meta-evaluation benchmark

Method:

  1. OffsetBias
  2. 使用GPT4生成off-topic(完全无关的话题)
  3. 用GPT3.5生成遵照off-topic回答的bad response
  4. 用good response, bad response来微调模型,减少bias
  5. EvalbiasBench
  6. 分析存在的meta-evaluation中的错误,归类这些bias
  7. 人工构造每种bias对应的prompts,再使用测试验证这些prompts输出错误的概率确实比较高
  8. 用来验证bias是否存在

注意:这里off-topic不会作为用于防止注入的data

Abstract

Github: https://github.com/ncsoft/offsetbias

5. Experimental Setup

5.1 Model Description

  • Judge models
    1. Base-data
    2. Basedata + OFFSETBIAS
  • Base-data: 268k human preference dataset
    • Ultrafeedback (Cui et al., 2023): single scoring task
    • Helpsteer (Wang et al., 2023b): single scoring task
    • HH-RLHF-Helpful-Online, HH-RLHF-Harmless-Base (Bai et al., 2022): pairwise comparison
    • a subset of PKU-SafeRLHF (Dai et al., 2024): pairwise comparison
    • 加入single scoring task能显著提升pairwise comparison的效果
    • augmenting data to stress position bias:把pair调转过来, double the size of dataset
  • Test OFFSETBias on training reward model
    • 意义
      • Q: 消除judge model中prompting和feedback generation的影响,只留下(I, Rg, Rb)的影响。(eliminate the influence of prompting and feedback generation in judge model performance, leaving only the impact of the (I, Rg, Rb) triplets of the data.
    • Challenge
      • 直接训练already fine-tuned reward models with new data会导致灾难性的遗忘catastrophic forgetting
        • Solution:
          1. Warm: On the benefits of weight averaged reward models
          2. 挑选FsfiarX-LLaMA3-RM-v0.1作为original model
          3. 利用original 模型的部分训练数据和OFFSETBias一起训练中间reward model
          4. 使用SLERP方法将中间reward model和original mode结合起来成为final model

5.2 Benchmarks

  • Generative models
    • LLMBar: a Natural subset and four Adversarial subsets, named Neighbor, GPTInst, GPTOut and Manual based on their construction methods.
    • HHH-Alignment: helpfulness, honesty, harmlessness and others
    • MT-Bench Human Judge: 80 prompts from the MT-Bench. Human annotators labeled 3.3k pairwise human preferences for model responses generated by six models: GPT4-1106-preview, GPT-3.5-turbo-0125, Claude-v1, Vicuna-13B, Alpaca-13B, and LLaMA-13B.
  • Reward model
    • RewardBench: Chat, Chat Hard, Safety, and Reasoning.

5.3 Baselines

  • Generative model baselines:
    • OpenAI’s GPT-4o-2024-05-13 and GPT-3.5-turbo0125 as proprietary baselines, PandaLM (Wang et al., 2024), AutoJ (Li et al., 2024) and Prometheus2 (Kim et al., 2024b) as state-of-theart evaluator models, and LLaMA-3-8B-Instruct (AI@Meta, 2024) as a baseline model.
    • We adopt original prompt templates of the models for fair comparison
  • EvalBiasBench + Generative model baselines
    • Phi-3-medium (Microsoft, 2024), Mixtral-8x-7Binstruct (MistralAI, 2024), LLaMA2-Chat-70B(GenAI@Meta, 2023) and LLaMA3-70B-Instruct (AI@Meta, 2024)
  • Reward model baselines
    • Eurus-RM-7B (Yuan et al., 2024), Starling-RM-34B (Zhu et al., 2023a), RMMistral-7B and FsfairX-LLaMa3-RM (Xiong et al., 2024)

6 Experiment Results

  1. 使用了macro average(类别之间直接平均sum/类别数,而不是按照数目加权平均)
  2. 使用了metric: positional agreement rate





posted @   雪溯  阅读(2)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· DeepSeek “源神”启动!「GitHub 热点速览」
· 我与微信审核的“相爱相杀”看个人小程序副业
· 微软正式发布.NET 10 Preview 1:开启下一代开发框架新篇章
· 如何使用 Uni-app 实现视频聊天(源码,支持安卓、iOS)
· C# 集成 DeepSeek 模型实现 AI 私有化(本地部署与 API 调用教程)
历史上的今天:
2020-12-30 Proj THUDBFuzz Paper Reading: VulSeeker-Pro: Enhanced Semantic Learning Based Binary Vulnerablity Seeker with Emulation
点击右上角即可分享
微信分享提示