Recent meta-reinforcement learning work has emphasized the importance of mnemonic control for agents to quickly assimilate relevant experience in new contexts and suitably adapt their policy. However, what computational mechanisms support flexible behavioral adaptation from past experience remains an open question. Inspired by neuroscience, we propose MetODS (for Meta-Optimized Dynamical Synapses), a broadly applicable model of meta-reinforcement learning which leverages fast synaptic dynamics influenced by action-reward feedback. We develop a theoretical interpretation of MetODS as a model learning powerful control rules in the policy space and demonstrate empirically that robust reinforcement learning programs emerge spontaneously from them. We further propose a formalism which efficiently optimizes the meta-parameters governing MetODS synaptic processes. In multiple experiments and domains, MetODS outperforms or compares favorably with previous meta-reinforcement learning approaches. Our agents can perform one-shot learning, approaches optimal exploration/exploitation strategies, generalize navigation principles to unseen environments and demonstrate a strong ability to learn adaptive motor policies.
翻译:近期的超强力学习工作强调,对于代理商来说,内核控制对于在新的环境下迅速吸收相关经验并适当调整其政策的重要性。然而,根据过去的经验,哪些计算机制支持灵活的行为适应仍然是一个尚未解决的问题。在神经科学的启发下,我们提议MetODS(用于元-优化动态同步合成),这是一个广泛应用的元力学习模式,它利用受行动回报反馈影响的快速合成动态。我们开发了对MetODS的理论解释,作为在政策空间学习强大控制规则的示范,并用经验表明,强力强化学习方案是自发产生的。我们进一步提出了一种形式主义,有效地优化管理MetODS合成过程的元参数。在多个实验和领域,MetODS的外形或优于先前的超强力学习方法。我们的代理商可以进行一线学习,采用最佳的探索/开发战略,将导航原则推广到看不见的环境,并展示学习适应机动车政策的强大能力。