We show that the simplest actor-critic method -- a linear softmax policy updated with TD through interaction with a linear MDP, but featuring no explicit regularization or exploration -- does not merely find an optimal policy, but moreover prefers high entropy optimal policies. To demonstrate the strength of this bias, the algorithm not only has no regularization, no projections, and no exploration like $\epsilon$-greedy, but is moreover trained on a single trajectory with no resets. The key consequence of the high entropy bias is that uniform mixing assumptions on the MDP, which exist in some form in all prior work, can be dropped: the implicit regularization of the high entropy bias is enough to ensure that all chains mix and an optimal policy is reached with high probability. As auxiliary contributions, this work decouples concerns between the actor and critic by writing the actor update as an explicit mirror descent, provides tools to uniformly bound mixing times within KL balls of policy space, and provides a projection-free TD analysis with its own implicit bias which can be run from an unmixed starting distribution.
翻译:我们显示,最简单的行为方-批评方法 -- -- 一种通过与线性 MDP互动与TD互动更新的线性软模政策,但没有明确的正规化或探索性 -- -- 不光是找到最佳政策,而且更倾向于高通缩最佳政策。为了显示这种偏差的强度,算法不仅没有正规化,没有预测,也没有像$\epsilon$-greedy这样的探索,而且没有进行单一的轨迹的训练。高通缩偏差的关键后果是,可以放弃对MDP的统一混合假设,这种假设以某种形式存在于以前的所有工作中:高通缩偏差的隐性正规化足以确保所有链组合和最佳政策都极有可能实现。作为辅助贡献,这项工作使演员和评论家之间的担忧分解,将演员的更新写成一个清晰的镜底,提供工具,在政策空间的KL球内统一结合时间,并提供无投影式TD分析及其隐含的偏差,这种分析可以从未混合的开始分布中运行。