We propose a novel reinforcement learning algorithm, AlphaNPI, that incorporates the strengths of Neural Programmer-Interpreters (NPI) and AlphaZero. NPI contributes structural biases in the form of modularity, hierarchy and recursion, which are helpful to reduce sample complexity, improve generalization and increase interpretability. AlphaZero contributes powerful neural network guided search algorithms, which we augment with recursion. AlphaNPI only assumes a hierarchical program specification with sparse rewards: 1 when the program execution satisfies the specification, and 0 otherwise. Using this specification, AlphaNPI is able to train NPI models effectively with RL for the first time, completely eliminating the need for strong supervision in the form of execution traces. The experiments show that AlphaNPI can sort as well as previous strongly supervised NPI variants. The AlphaNPI agent is also trained on a Tower of Hanoi puzzle with two disks and is shown to generalize to puzzles with an arbitrary number of disk
翻译:我们提出一个新的强化学习算法,即阿尔法NPI,该算法包含神经程序解释员(NPI)和阿尔法Zero的优势。NPI以模块化、等级和循环形式提供结构性偏差,有助于减少样本复杂性、改进一般化和增加可解释性。阿尔法泽罗提供了强大的神经网络引导搜索算法,我们通过循环来增加这些算法。AlphaNPI仅以微薄的奖赏承担一个等级级程序规格:一个是程序执行符合规格,另一个是零。使用这一规格,ArphanNPI首次能够用RL对NPI模型进行有效的培训,完全消除以处决痕迹形式进行强有力监督的需要。实验表明,阿尔法NPI可以分解以前受到严密监督的NPI变体。阿尔法NPI代理器还接受了两个磁盘在河内拼图塔上的培训,并展示了与任意数量的磁盘的拼图。