PPO (Proximal Policy Optimization) is a state-of-the-art policy gradient algorithm that has been successfully applied to complex computer games such as Dota 2 and Honor of Kings. In these environments, an agent makes compound actions consisting of multiple sub-actions. PPO uses clipping to restrict policy updates. Although clipping is simple and effective, it is not efficient in its sample use. For compound actions, most PPO implementations consider the joint probability (density) of sub-actions, which means that if the ratio of a sample (state compound-action pair) exceeds the range, the gradient the sample produces is zero. Instead, for each sub-action we calculate the loss separately, which is less prone to clipping during updates thereby making better use of samples. Further, we propose a multi-action mixed loss that combines joint and separate probabilities. We perform experiments in Gym-$\mu$RTS and MuJoCo. Our hybrid model improves performance by more than 50\% in different MuJoCo environments compared to OpenAI's PPO benchmark results. And in Gym-$\mu$RTS, we find the sub-action loss outperforms the standard PPO approach, especially when the clip range is large. Our findings suggest this method can better balance the use-efficiency and quality of samples.
翻译:PPO(优化公共政策)是一种先进的政策梯度算法,已经成功地应用于Dota 2和Honor of Kings等复杂的计算机游戏。 在这种环境中, 代理商会采取由多个子动作组成的复合动作。 PPO 使用剪报来限制政策更新。 虽然剪报简单而有效, 但其抽样使用效率不高。 对于复合行动, 大多数 PPPO 的实施会考虑子行动的共同概率( 密度) 。 这意味着, 如果抽样( 州复合行动对) 的比例超过范围, 样本产生的梯度是零。 相反, 对于每一个子行动, 我们分别计算损失, 在更新过程中更容易剪报, 从而更好地使用样本。 此外, 我们提出多动作混合损失, 将联合和单独概率结合起来。 我们在 Gym- $\ mu$ RTS 和 MuJoCo 进行实验。 我们的混合模型在不同的 Mujoco 环境比 Open IPO 基准结果要高50%以上, 则样本产生的梯度为零。 。 在GPO 中, 我们的模型中, 我们发现这个质量比值 的模型会建议, 我们的模型中的质量比值为 。