In Reinforcement Learning, the optimal action at a given state is dependent on policy decisions at subsequent states. As a consequence, the learning targets evolve with time and the policy optimization process must be efficient at unlearning what it previously learnt. In this paper, we discover that the policy gradient theorem prescribes policy updates that are slow to unlearn because of their structural symmetry with respect to the value target. To increase the unlearning speed, we study a novel policy update: the gradient of the cross-entropy loss with respect to the action maximizing $q$, but find that such updates may lead to a decrease in value. Consequently, we introduce a modified policy update devoid of that flaw, and prove its guarantees of convergence to global optimality in $\mathcal{O}(t^{-1})$ under classic assumptions. Further, we assess standard policy updates and our cross-entropy policy updates along six analytical dimensions. Finally, we empirically validate our theoretical findings.
翻译:在强化学习中,特定国家的最佳行动取决于随后各州的政策决定。因此,学习目标随着时间而变化,政策优化进程必须高效地摆脱以往学到的东西。在本文中,我们发现政策梯度理论规定了因与价值目标的结构对称而缓慢解脱的政策更新。为了提高不学习的速度,我们研究一个新的政策更新:与最大限度地提高美元的行动有关的跨性血压损失梯度,但发现这种更新可能导致价值下降。因此,我们引入了没有这一缺陷的修改政策更新,并证明根据传统假设,它保证与全球最佳性一致,以美元(t ⁇ -1})美元。此外,我们评估标准政策更新和我们交叉性花粉政策更新,并同时进行六个分析。最后,我们从经验上验证了我们的理论结论。