摘要: 理论基础 Policy Gradient: $$ R_\theta = \sum_\tau reward(\tau)p_\theta(\tau) \\ \nabla R_\theta = \sum_\tau reward(\tau) \nabla p_\theta(\tau) \\ = \sum_\ 阅读全文
posted @ 2020-04-13 22:16 xytpai 阅读(188) 评论(0) 推荐(0) 编辑