MoonOut - 博客园

[置顶] LaTex · overleaf | 使用技巧存档

摘要：零零散散的经验，存下来方便查阅。阅读全文

posted @ 2023-06-16 10:10 MoonOut 阅读(255) 评论(1) 推荐(0) 编辑

[置顶] Git | git branch 分支操作

摘要：在简单的真实场景下，列举 git branch 系列命令的使用流程阅读全文

posted @ 2022-11-23 21:48 MoonOut 阅读(26) 评论(0) 推荐(0) 编辑

2024年12月24日

论文速读记录 | 2024.12

摘要： 22024.12 | 速读文章记录阅读全文

posted @ 2024-12-24 11:50 MoonOut 阅读(4) 评论(0) 推荐(0) 编辑

2024年11月30日

offline RL · PbRL | LiRE：构造 A>B>C 的 RLT 列表，得到更多 preference 数据

摘要： LiRE 的主要贡献（故事）：1. 构造 A>B>C 的 RLT，利用二阶偏好信息；2. 使用线性 reward model，提升 PbRL 性能。阅读全文

posted @ 2024-11-30 16:07 MoonOut 阅读(104) 评论(0) 推荐(0) 编辑

2024年11月26日

Contrastive Learning 对比学习 | RL 学 representation 时的对比学习

摘要：在 RL 的 representation learning 中，应用对比学习思想和 InfoNCE loss。阅读全文

posted @ 2024-11-26 12:24 MoonOut 阅读(172) 评论(0) 推荐(0) 编辑

2024年11月21日

RL 基础 | 如何复现 PPO，以及一些踩坑经历

摘要：记录一下最近复现 PPO 的过程…… 阅读全文

posted @ 2024-11-21 16:29 MoonOut 阅读(302) 评论(0) 推荐(1) 编辑

2024年11月20日

PbRL | Christiano 2017 年的开山之作，以及 Preference PPO / PrefPPO

摘要： Deep reinforcement learning from human preferences 论文阅读，以及 PrefPPO 算法阅读。阅读全文

posted @ 2024-11-20 15:16 MoonOut 阅读(82) 评论(0) 推荐(0) 编辑

2024年11月11日

RL 基础 | 如何使用 OpenAI Gym 接口，搭建自定义 RL 环境（详细版）

摘要：需实现 env.__init__() , obs = env.reset() , obs, reward, done, info = env.step(action) 函数。阅读全文

posted @ 2024-11-11 22:53 MoonOut 阅读(161) 评论(0) 推荐(0) 编辑

2024年10月15日

使用 GPT 绘制类图、流程图等 UML 图

摘要：让 GPT 生成 UML 图的 plantUML 代码，然后在 plantUML 网站在线绘制阅读全文

posted @ 2024-10-15 19:22 MoonOut 阅读(301) 评论(0) 推荐(0) 编辑

2024年9月4日

Git | 重新初始化一个目录下的 git

摘要：删除 git 目录：rm -rf .git 阅读全文

posted @ 2024-09-04 16:41 MoonOut 阅读(46) 评论(0) 推荐(0) 编辑

2024年8月11日

python · pytorch | 限制程序仅使用 8 个线程

摘要： torch.set_num_threads(8) 阅读全文

posted @ 2024-08-11 18:17 MoonOut 阅读(58) 评论(0) 推荐(0) 编辑

（已解决）OpneGL · MuJoCo · Metaworld | RuntimeError: Failed to initialize OpenGL, assert mdl is not None, AssertionError

摘要：在命令行执行 unset LD_PRELOAD 阅读全文

posted @ 2024-08-11 18:13 MoonOut 阅读(78) 评论(0) 推荐(0) 编辑

2024年7月31日

在 Linux 服务器安装 MuJoCo 210

摘要：官方教程：https://gist.github.com/saratrajput/60b1310fe9d9df664f9983b38b50d5da 阅读全文

posted @ 2024-07-31 17:57 MoonOut 阅读(37) 评论(0) 推荐(0) 编辑

Git | 如何在新服务器上配置 git

摘要： ssh-keygen -t rsa，然后将 .ssh/id_rsa.pub 中的内容，复制到 New SSH key 的框里。阅读全文

posted @ 2024-07-31 16:31 MoonOut 阅读(21) 评论(0) 推荐(0) 编辑

Conda | 如何（在新服务器上）复制一份旧服务器的 conda 环境，Linux 服务器

摘要： conda env export -n old_env > old_env_conda.yml , conda env create -n new_env -f old_env_conda.yml 阅读全文

posted @ 2024-07-31 11:40 MoonOut 阅读(332) 评论(0) 推荐(0) 编辑

Conda | 如何在 Linux 服务器安装 conda

摘要： Google 搜索官方教程 + 换 tuna 源。阅读全文

posted @ 2024-07-31 11:38 MoonOut 阅读(1083) 评论(0) 推荐(0) 编辑

如何免密码登录 Linux 服务器 · ssh 密钥

摘要：在远程创建 ~/.ssh/authorized_keys，把本地 .ssh/id_rsa.pub 的内容追加到 authorized_keys 里。阅读全文

posted @ 2024-07-31 10:46 MoonOut 阅读(18) 评论(0) 推荐(0) 编辑

2024年7月25日

PbRL | RIME：用交叉熵 loss 大小分辨 preference 是否正确 + 内在奖励预训练 reward model

摘要： ① 假设正确样本的 CELoss 上限是 ρ，可推出错误样本相对 P_ψ(x) 分布的 KL 散度上限，从而筛出可信样本、翻转不可信样本；② 用归一化到 (-1,1) 的 intrinsic reward 预训练 reward model。阅读全文

posted @ 2024-07-25 16:10 MoonOut 阅读(117) 评论(0) 推荐(0) 编辑

交叉熵、KL 散度 | 定义与相互关系

摘要： D_KL(P||Q) = ∫p(x) log p(x) - ∫p(x) log q(x) = H(P) + H(P,Q) 阅读全文

posted @ 2024-07-25 12:35 MoonOut 阅读(126) 评论(0) 推荐(0) 编辑

2024年6月23日

关于 KL 散度和变分推断的 ELBO

摘要： ELBO 用于最小化 q(z|s) 和 p(z|s) 的 KL 散度，变成最大化 p(x|z) 的 log likelihood + 最小化 q(z|s) 和先验 p(z) 的 KL 散度。阅读全文

posted @ 2024-06-23 18:10 MoonOut 阅读(702) 评论(0) 推荐(0) 编辑

整理并发布本科四年的课程资料

摘要：整理并发布本科四年的课程资料。阅读全文

posted @ 2024-06-23 16:50 MoonOut 阅读(88) 评论(1) 推荐(0) 编辑

python · pandas |（已解决）AttributeError: 'DataFrame' object has no attribute 'append'

摘要：使用 df.loc[len(df)] = {'key1': 123, 'key2': 234} 阅读全文

posted @ 2024-06-23 15:39 MoonOut 阅读(220) 评论(0) 推荐(0) 编辑

2024年6月12日

如何将 iPhone 的照片同步到 windows 电脑上（非常快，不耗流量）

摘要：电脑端：在同一个局域网下 + 共享文件夹；手机端：文件 app 连接服务器 + 照片保存到文件。阅读全文

posted @ 2024-06-12 11:19 MoonOut 阅读(213) 评论(0) 推荐(0) 编辑

2024年5月28日

MORL | 速通三大会的 MORL 工作

摘要：简单看了一下三大会近期的 Multi-objective RL 工作。阅读全文

posted @ 2024-05-28 22:31 MoonOut 阅读(319) 评论(0) 推荐(0) 编辑

2024年4月22日

如何 kill 掉所有跟 wandb 相关的进程

摘要：【ps -ef | grep '[w]andb'】【pkill -f wandb】阅读全文

posted @ 2024-04-22 11:31 MoonOut 阅读(504) 评论(0) 推荐(0) 编辑

2024年3月21日

发布「月出兮彩云归」的新 icon

摘要：在某些契机下，制作了构想很久的个人 icon。阅读全文

posted @ 2024-03-21 16:55 MoonOut 阅读(25) 评论(0) 推荐(0) 编辑

RL 基础 | Policy Gradient 的推导

摘要：如果想最大化期望下的 R(τ)，那么策略梯度 = R(τ) · Σ ▽log π(a|s) ，即 discounted return × Σ 梯度 log [选取该 action 的概率] 。阅读全文

posted @ 2024-03-21 16:46 MoonOut 阅读(203) 评论(0) 推荐(0) 编辑

2024年3月9日

offline RL | D4RL：最常用的 offline 数据集之一

摘要： ① medium：中等策略。② random：随机策略。③ medium-replay：训到中等策略的整个 replay buffer。④ medium-expert：等量混合专家数据和次优数据（次优或随机策略）。阅读全文

posted @ 2024-03-09 17:36 MoonOut 阅读(920) 评论(0) 推荐(0) 编辑

2024年3月7日

Contrastive Learning 对比学习 | 何恺明大神的 SimSiam

摘要：主要 trick：① 更新 A 时把 B stop-gradient，② 在 encoder 后添加神秘的 MLP 层。阅读全文

posted @ 2024-03-07 20:40 MoonOut 阅读(967) 评论(0) 推荐(0) 编辑

2024年3月6日

offline RL · PbRL | Preference Transformer：反正感觉 transformer 很强大

摘要： ① 定义 non-Markovian reward 的输入是 trajectory，② 使用 exp Σ w(τ) · r(τ) 的 preference 形式。阅读全文

posted @ 2024-03-06 12:57 MoonOut 阅读(271) 评论(1) 推荐(0) 编辑

2024年2月29日

贴个不知真假的 representation learning for RL 综述

摘要： Act as a reinforcement learning expert. Please do a review for representation learning in RL. Should focus on how to map a trajectory to a latent. 阅读全文

posted @ 2024-02-29 16:10 MoonOut 阅读(88) 评论(1) 推荐(0) 编辑

2024年2月27日

offline RL · PbRL | OPPO：PbRL 场景的 offline hindsight transformer

摘要：将 offline HIM 应用到 PbRL，① 用离线轨迹训练 a=π(s,z) ，② 训练最优 hindsight z* 靠近 z+ 远离 z-。阅读全文

posted @ 2024-02-27 21:38 MoonOut 阅读(65) 评论(0) 推荐(0) 编辑

offline RL | HIM：基于 hindsight 的 RL 是一类大 idea

摘要：将 offline 训练轨迹中，当前时刻之后发生的事作为 hindsight，从而训练出想要达到当前 hindsight 的 action。阅读全文

posted @ 2024-02-27 21:08 MoonOut 阅读(187) 评论(0) 推荐(0) 编辑

offline RL | 读读 Decision Transformer

摘要： ① sequence: {s, a, R, s, ...}；② 在 s 的 decode 结果上加 MLP 预测 action；③ 给定 return-to-go 作为某种 hindsight。阅读全文

posted @ 2024-02-27 20:14 MoonOut 阅读(605) 评论(0) 推荐(2) 编辑

2024年2月20日

python · matplotlib | seaborn 画图与调整图例位置

摘要：画图代码的存档。阅读全文

posted @ 2024-02-20 11:29 MoonOut 阅读(405) 评论(0) 推荐(0) 编辑

2024年2月17日

PID 控制 |（搬运）教程与 python 代码

摘要： Proportional-Integral-Derivative（PID），比例-积分-微分控制。阅读全文

posted @ 2024-02-17 10:55 MoonOut 阅读(364) 评论(0) 推荐(0) 编辑

2024年2月7日

offline 2 online | Cal-QL：校准保守 offline 训出的 Q value，让它与真实 reward 尺度相当

摘要： ① unlearn：保守 offline RL 训出的 Q function 太小，被 online 的真 reward 量级压制，导致 policy 初始化破坏，性能下降。② 校准：魔改 CQL 惩罚，令 Q_θ ≥ Q_β。阅读全文

posted @ 2024-02-07 20:14 MoonOut 阅读(82) 评论(0) 推荐(0) 编辑

offline 2 online | 重要性采样，把 offline + online 数据化为 on-policy samples

摘要：在 offline + online buffer 的采样概率，应当与 d^{on}(s,a) / d^{off}(s,a) 成正比（importance sampling）。阅读全文

posted @ 2024-02-07 14:08 MoonOut 阅读(223) 评论(0) 推荐(1) 编辑

2024年2月6日

凸优化 | 期末复习笔记存档

摘要：出分后发布笔记…… 阅读全文

posted @ 2024-02-06 11:01 MoonOut 阅读(290) 评论(0) 推荐(0) 编辑

概率图 | 两次小测的笔记存档

摘要：出分后发布笔记…… 阅读全文

posted @ 2024-02-06 10:47 MoonOut 阅读(43) 评论(0) 推荐(0) 编辑

复杂系统 | 20240116 · 考试题目回忆版

摘要：出分后发布笔记…… 阅读全文

posted @ 2024-02-06 10:37 MoonOut 阅读(36) 评论(0) 推荐(0) 编辑

复杂系统 | 考前知识点总结（不完全）

摘要：出分后发布笔记…… 阅读全文

posted @ 2024-02-06 10:37 MoonOut 阅读(36) 评论(0) 推荐(1) 编辑