Many policy gradient methods are variants of Actor-Critic (AC), where a value function (critic) is learned to facilitate updating the parameterized policy (actor). The update to the actor involves a log-likelihood update weighted by the action-values, with the addition of entropy regularization for soft variants. In this work, we explore an alternative update for the actor, based on an extension of the cross entropy method (CEM) to condition on inputs (states). The idea is to start with a broader policy and slowly concentrate around maximal actions, using a maximum likelihood update towards actions in the top percentile per state. The speed of this concentration is controlled by a proposal policy, that concentrates at a slower rate than the actor. We first provide a policy improvement result in an idealized setting, and then prove that our conditional CEM (CCEM) strategy tracks a CEM update per state, even with changing action-values. We empirically show that our Greedy AC algorithm, that uses CCEM for the actor update, performs better than Soft Actor-Critic and is much less sensitive to entropy-regularization.
翻译:许多政策梯度方法都是Acor-Critic(AC)的变体,在那里学习了价值函数(critic)以促进更新参数化政策(actor) 。 向行为方提供的更新涉及以行动值加权的日志更新, 加上软变体的变体的星体正规化。 在这项工作中, 我们探索了为行为方提供的替代更新, 其依据是跨倍增温法( CEM)的延伸, 以输入为条件( 状态 ) 。 设想是先从更广泛的政策开始, 慢慢集中围绕最大行动, 使用最大的可能性更新到每州最高百分位的行动 。 这种集中速度由提案政策控制, 其集中速度比行为者慢。 我们首先提供政策改进政策的结果是理想化, 然后证明我们有条件的 CEM (CCEM) 战略跟踪每个状态的CEM 更新, 即使以变化的行动值为条件 。 我们的经验显示, 我们的贪婪AC 算法, 使用CEM 来更新行为者的更新, 进行比 Soft Actor- Critictal- transtical- transtical- realiztion 。</s>