The policy gradient theorem states that the policy should only be updated in states that are visited by the current policy, which leads to insufficient planning in the off-policy states, and thus to convergence to suboptimal policies. We tackle this planning issue by extending the policy gradient theory to policy updates with respect to any state density. Under these generalized policy updates, we show convergence to optimality under a necessary and sufficient condition on the updates' state densities, and thereby solve the aforementioned planning issue. We also prove asymptotic convergence rates that significantly improve those in the policy gradient literature. To implement the principles prescribed by our theory, we propose an agent, Dr Jekyll & Mr Hyde (JH), with a double personality: Dr Jekyll purely exploits while Mr Hyde purely explores. JH's independent policies allow to record two separate replay buffers: one on-policy (Dr Jekyll's) and one off-policy (Mr Hyde's), and therefore to update JH's models with a mixture of on-policy and off-policy updates. More than an algorithm, JH defines principles for actor-critic algorithms to satisfy the requirements we identify in our analysis. We extensively test on finite MDPs where JH demonstrates a superior ability to recover from converging to a suboptimal policy without impairing its speed of convergence. We also implement a deep version of the algorithm and test it on a simple problem where it shows promising results.
翻译:政策梯度理论指出,政策应当只在现行政策所考察的国家更新政策,从而导致非政策国家规划不足,从而导致与次最佳政策趋同。我们通过将政策梯度理论扩展至与任何州密度有关的政策更新,将政策梯度理论扩大到与任何州密度有关的政策更新来应对这一规划问题。根据这些普遍的政策更新,我们显示在更新的州密度必要和充分的条件下,在最佳条件下实现政策趋同,从而解决上述规划问题。我们还证明,在政策梯度文献中,政策趋同率的简单化速度大大改进了政策梯度文献中的数据。为了执行我们理论规定的原则,我们建议一个具有双重人格的代理机构,Jekyll博士和Hyde先生(JH)博士(JH)博士和Hyde先生(JH)博士(JH)将政策梯度理论纯粹加以利用,而Hyde先生则纯粹探索。根据这些普遍的政策更新,JHKL的测试能力分析可以记录两个单独的缓冲:一个是政策(Dr Jkykilling),一个是政策(D)和JDL)下的最新政策更新。