人工智能(强化学习)—— Why is Soft Q Learning not an Actor Critic method? —— SQL算法为什么不是Actor-Critic算法
原文:
https://ai.stackexchange.com/questions/39545/why-is-soft-q-learning-not-an-actor-critic-method
I've been reading these two papers from Haarnoja et. al.:
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
Reinforcement Learning with Deep Energy-Based Policies
As far as I can tell, Soft Q-Learning (SQL) and SAC appear very similar. Why is SQL not considered an Actor-Critic method, even though it has an action value network (critic?) and policy network (actor?)? I also cannot seem to find a consensus on the exact definition of an Actor-Critic method.
回答:
Indeed SQL is very similar to actor-critic method which has a soft Q-function critic network with parameter θ
and an actor policy network with parameter ϕ
, and in fact the paper "Equivalence Between Policy Gradients and Soft Q-Learning" by Schulman et al proves equivalence between the gradient of soft Q-learning within maximum entropy RL framework and natural policies with entropy regularization in its gradient estimator, and further talked about soft Q-learning:
Haarnoja et al. [2017] work in the same setting of soft Q-learning as the current paper, and they are concerned with tasks with high-dimensional action spaces, where we would like to learn stochastic policies that are multi-modal, and we would like to use Q-functions for which there is no closed-form way of sampling from the Boltzmann distribution π(a|s)∝π(a|s)exp(Q(s,a)/τ)
. Hence, they use a method called Stein Variational Gradient Descent to derive a procedure that jointly updates the Q-function and a policy π
, which approximately samples from the Boltzmann distribution—this resembles variational inference, where one makes use of an approximate posterior distribution.
Having said that, in terms of Haarnoja et al's own words about the subtle difference between SQL and actor-critic method in your first reference below, the estimated sampled action from SVGD network is not used directly to affect the next soft Q-function since the critic network uses minibatch experiences from its stored replay memory to update its parameter instead of the usual advantage function. If it's not sampled accurately enough it may not be stable and converge to an optimal final (stochastic) solution.
Although the soft Q-learning algorithm proposed by Haarnoja et al. (2017) has a value function and actor network, it is not a true actor-critic algorithm: the Q-function is estimating the optimal Q-function, and the actor does not directly affect the Q-function except through the data distribution. Hence, Haarnoja et al. (2017) motivates the actor network as an approximate sampler, rather than the actor in an actor-critic algorithm. Crucially, the convergence of this method hinges on how well this sampler approximates the true posterior.
Thank you for the detailed answer. If I'm understanding this correctly, is the main point that the sampled actions used for updates to the critic network have to come directly from the actor (on-policy) and not from a replay pool (off-policy) for it to be a true actor-critic method? –
frances_farmer
CommentedMar 23, 2023 at 11:08
Not necessarily, as off-policy DDPG leverages DQN's replay buffer idea to uniformly sample minibatch experiences while being a actor-critic method (usually replay buffer only implies off policy not actor-critic). Here Haarnoja et al simply classifies true actor to be directly affect the Q-function like in DDPG via the usual n-step return advantage function, not via another gradient ascent of some performance metric for estimating Q-function in SQL indirectly. They also noted its MAP variant connection with DDPG in the same paper, thus it's not that essential clear cut for my point of view.
上面讨论的是SQL算法(soft q-learning算法)是不是actor-critic算法中的一种,回答者认为SQL不是Actor-Critic算法,而DDPG算法是Actor-Critic算法,其中主要的观点依据是:
Although the soft Q-learning algorithm proposed by Haarnoja et al. (2017) has a value function and actor network, it is not a true actor-critic algorithm: the Q-function is estimating the optimal Q-function, and the actor does not directly affect the Q-function except through the data distribution. Hence, Haarnoja et al. (2017) motivates the actor network as an approximate sampler, rather than the actor in an actor-critic algorithm. Crucially, the convergence of this method hinges on how well this sampler approximates the true posterior.
其中的关键为:
the actor does not directly affect the Q-function
因此回答者认为 SQL不是actor-critic算法。
同时回答者依据 simply classifies true actor to be directly affect the Q-function 从而认为DDPG是actor-critic算法,因为在DDPG算法中actor的输出会参与critic部分的计算,也就是在计算actor部分的LOSS计算时是需要将actor部分的输出传入到critic部分的,因此在在DDPG中actor部分和critic部分是相互关联的,由此回答者认为DDPG属于actor-critic算法,而SQL算法中actor部分和critic部分在LOSS函数部分并没有实现具体的关联,因此SQL算法不属于Actor-Critic算法。
个人观点:
到底什么样的RL算法属于Actor-Critic算法呢,其实并没有完全的定论,其中一种比较常见的划分规则就是带有actor和critic部分的都属于actor-critic算法;还有一种观点是使用了policy gradient计算公式并使用value estimate的是actor-critic算法;还有一种观点是actor部分和critic部分需要在loss function中有所关联的才属于actor-critic算法;还有一种观点是critic部分的value function是由actor部分所影响的,也就是说critic部分是对actor部分的直接评价,这种的以标准的actor-critic算法为代表。
可以看到,具体如何判定算法是否属于Actor-Critic算法是没有一个统一标准的,而我个人的观点是使用了使用了policy gradient计算公式并使用value estimate的是actor-critic算法,因此在我看来DDPG、TD3、SAC、SQL算法都不属于Actor-Critic算法,因为这些算法均没有使用policy gradient的理论,虽然这些算法也都是带有actor和critic网络的,而我认为认定算法是否属于actor-critic算法的标准应该是其是否在理论层面上源自于policy gradient理论,而不是看其他的所谓的形式和互相影响上。
posted on 2024-12-11 10:51 Angry_Panda 阅读(12) 评论(0) 编辑 收藏 举报