Many policy gradient methods are variants of Actor-Critic (AC), where a value function (critic) is learned to facilitate updating the parameterized policy (actor). The update to the actor involves a log-likelihood update weighted by the action-values, with the addition of entropy regularization for soft variants. In this work, we explore an alternative update for the actor, based on an extension of the cross entropy method (CEM) to condition on inputs (states). The idea is to start with a broader policy and slowly concentrate around maximal actions, using a maximum likelihood update towards actions in the top percentile per state. The speed of this concentration is controlled by a proposal policy, that concentrates at a slower rate than the actor. We first provide a policy improvement result in an idealized setting, and then prove that our conditional CEM (CCEM) strategy tracks a CEM update per state, even with changing action-values. We empirically show that our Greedy AC algorithm, that uses CCEM for the actor update, performs better than Soft AC and is much less sensitive to entropy-regularization.
翻译:许多政策梯度方法都是Acor-Critic(AC)的变体,在那里学习了价值函数(critic)以促进更新参数化政策(actor) 。 向行为方提供的更新涉及以行动值加权的日志更新, 加上对软变体的变异物的变异性, 并添加了 entropy 正规化 。 在这项工作中, 我们探索了为行为方提供的替代更新, 其依据是跨倍增速法( CEM) 的延伸, 以输入为条件( 状态 ) 。 我们的构想是先从更广泛的政策开始, 慢慢集中围绕最大动作, 使用最大可能性的更新, 来更新每个州最高百分位的动作 。 这种集中速度由提案政策控制, 其集中速度比行为者慢。 我们首先提供政策改进的结果是理想化的环境, 然后证明我们有条件的 CEM (CCEM) 战略跟踪每个州的 CEM 更新 CEM, 即使以变化的行动值为条件。 我们的经验显示, 我们的贪化的AC 算算方法使用 CEM 来更新行为者更新, 更新时比软ACEM 进行得更好, 并且不那么对 IP- transc- rec- rec- regen 。