We say an algorithm is batch size-invariant if changes to the batch size can largely be compensated for by changes to other hyperparameters. Stochastic gradient descent is well-known to have this property at small batch sizes, via the learning rate. However, some policy optimization algorithms (such as PPO) do not have this property, because of how they control the size of policy updates. In this work we show how to make these algorithms batch size-invariant. Our key insight is to decouple the proximal policy (used for controlling policy updates) from the behavior policy (used for off-policy corrections). Our experiments help explain why these algorithms work, and additionally show how they can make more efficient use of stale data.
翻译:我们说,如果对批量规模的改变可以通过其他超参数的改变来补偿,那么算法就是批量大小的变异。 已知的斯托切梯度梯度下降通过学习率以小批量大小的形式拥有这种属性。 但是,有些政策优化算法(如PPO)没有这种属性,因为它们如何控制政策更新的大小。 在这项工作中,我们展示了如何使这些批量的批量大小变异性。 我们的关键洞察力是(用于控制政策更新的)准政策与行为政策(用于非政策更正的)脱钩。 我们的实验有助于解释这些算法为何起作用,并额外展示它们如何更有效地使用陈旧的数据。