CS294-112 深度强化学习 秋季学期(伯克利)NO.3 Reinforcement learning introduction
first order markov chain
on policy algorithm is easier to be paralleled
off policy algorithm has to fit transition net, and policy net. much more computationally expensive