摘要:GRPO (Group Relative Policy Optimization ) GRPO https://arxiv.org/pdf/2402.03300 对于每个question q,GRPO从old policy \(\pi_{old}\) 采样一组输出 \({o_1, o_2 ...,o
阅读全文
posted @ 2025-02-17 19:23
02 2025 档案
摘要:GRPO (Group Relative Policy Optimization ) GRPO https://arxiv.org/pdf/2402.03300 对于每个question q,GRPO从old policy \(\pi_{old}\) 采样一组输出 \({o_1, o_2 ...,o
阅读全文
posted @ 2025-02-17 19:23
|
||