Several low-bandwidth distributable black-box optimization algorithms in the family of finite differences such as Evolution Strategies have recently been shown to perform nearly as well as tailored Reinforcement Learning methods in some Reinforcement Learning domains. One shortcoming of these black-box methods is that they must collect information about the structure of the return function at every update, and can often employ only information drawn from a distribution centered around the current parameters. As a result, when these algorithms are distributed across many machines, a significant portion of total runtime may be spent with many machines idle, waiting for a final return and then for an update to be calculated. In this work we introduce a novel method to use older data in finite difference algorithms, which produces a scalable algorithm that avoids significant idle time or wasted computation.
翻译:在有限差异的大家庭中,一些低带宽分配黑盒优化算法,例如《进化战略》,最近被证明在某些“强化学习”领域几乎和量身定制的“强化学习”方法一样。这些黑盒方法的一个缺点是,它们必须在每次更新时收集返回功能结构的信息,而且往往只能使用以当前参数为中心的分布法中的信息。因此,当这些算法分布在许多机器中时,总运行时间的很大一部分可能花在很多闲置的机器身上,等待最终返回,然后进行更新计算。在这项工作中,我们引入了一种新的方法,在限定差异算法中使用老的数据,从而产生一种可缩放的算法,避免大量闲置时间或浪费计算。