Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

发表时间：2024(ICLR2024)
文章要点： 文章提出用预训练的视觉语言模型作为zero-shot的reward model（VLM-RMs）。好处在于可以通过自然语言来给定一个具体的任务，通过VLM-RMs让强化学习基于reward学习这个任务（using pretrained vision-language models (VLMs) as zero shot reward models (RMs) to specify tasks via natural language）。这样的好处是不用人工设计reward，而且任务自定义扩大了强化的适用范围。
具体的，作者用CLIP作为基础模型，其中包括CLIP image encoder和CLIP language encoder。将图片和任务描述编码成embedding后计算余弦相似度得到reward。

方法基本上就这么简单。
此外作者还设计了一个Goal-Baseline Regularization，不过在mujoco上没效果。这个regularization的出发点是想讲无关信息去掉，指保留和任务相关的信息来计算reward（projecting out irrelevant information about the observation）。具体的，除了任务描述外，还定义了一个baseline描述，比如任务描述是a humanoid robot kneeling，baseline描述是a humanoid robot。然后reward定义为

这个proj的目的是projecting our state embedding onto the line spanned by the baseline and task embeddings。不过作者也说了这个映射并不一定就正确，后面mujoco的实验也表明不用其实效果更好。
还有个细节就是图像的纹理，作者发现图片更真实的话，reward更准确（zero-shot VLM-based rewards work better in environments that are more “photorealistic” because they are closer to the training distribution of the underlying VLM）。
总结：很有意思的工作，任务可以自己定义了，而且是图像输入。效果看起来还不算惊艳，不过方向应用面很广。作者在附录里也说了，这种方式主要还是focus on goal-based tasks，因为reward的计算是基于状态和任务的相似度的，这种设计比较顺理成章（because they are most straightforward to specify using image-text encoder VLMs.）。
不过文章确实方法上novelty有限，实验也做的很少，有两个reject也合理。不过架不住有人抬一手啊，换做我们肯定凉透了。
疑问：文章说alpha取0的时候就是不带regularization的reward，没看出来这两式子一样呢？

posted @ 2024-06-11 11:15 initial_h 阅读(298) 评论(0) 收藏举报

initial_h

https://github.com/initial-h

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

公告