Fork me on GitHub

Coding Poineer

Coding Poineer

Coding Poineer

Coding Poineer

Coding Poineer

Coding Poineer

Coding Poineer

Coding Poineer

Coding Poineer

Coding Poineer

Coding Poineer

02 2025 档案

摘要:virualenv创建虚拟环境:virtualenv myenv --python=/usr/bin/python3.11 grpo原理:https://huggingface.co/docs/trl/main/en/grpo_trainer (https://mp.weixin.qq.com/s? 阅读全文
posted @ 2025-02-26 09:19 365/24/60 阅读(12) 评论(0) 推荐(0) 编辑
摘要:Reward Hacking 模型通过利用奖励系统的设计缺陷或漏洞,采取非预期的行为来获取高额奖励,而不是真正实现设计者期望的目标 字节token https://mp.weixin.qq.com/s/lsCshrnmtO-bYaszLFBSNw DeepSeek训练图解:https://zhuan 阅读全文
posted @ 2025-02-10 10:45 365/24/60 阅读(5) 评论(0) 推荐(0) 编辑

点击右上角即可分享
微信分享提示