深强化学习告诉我们的人类汽车学习和反人类汽车学习 (What deep reinforcement learning tells us about human motor learning and vice-versa)

Machine learning and specifically reinforcement learning (RL) has been extremely successful in helping us to understand neural decision making processes. However, RL's role in understanding other neural processes especially motor learning is much less well explored. To explore this connection, we investigated how recent deep RL methods correspond to the dominant motor learning framework in neuroscience, error-based learning. Error-based learning can be probed using a mirror reversal adaptation paradigm, where it produces distinctive qualitative predictions that are observed in humans. We therefore tested three major families of modern deep RL algorithm on a mirror reversal perturbation. Surprisingly, all of the algorithms failed to mimic human behaviour and indeed displayed qualitatively different behaviour from that predicted by error-based learning. To fill this gap, we introduce a novel deep RL algorithm: model-based deterministic policy gradients (MB-DPG). MB-DPG draws inspiration from error-based learning by explicitly relying on the observed outcome of actions. We show MB-DPG captures (human) error-based learning under mirror-reversal and rotational perturbation. Next, we demonstrate error-based learning in the form of MB-DPG learns faster than canonical model-free algorithms on complex arm-based reaching tasks, while being more robust to (forward) model misspecification than model-based RL. These findings highlight the gap between current deep RL methods and human motor adaptation and offer a route to closing this gap, facilitating future beneficial interaction between between the two fields.

翻译：(RL) 在帮助我们理解神经决策过程的过程中,基于错误的机器学习和具体的强化学习(RL)在帮助我们理解神经决策过程方面非常成功。然而,对于RL在理解其他神经过程,特别是运动学习过程中的作用的探讨却少得多。为了探索这一联系,我们调查了最近深入的RL方法如何与神经科学、基于错误的学习中占主导地位的运动学习框架相对应。基于错误的学习可以使用镜反向适应模式进行考察,从而产生在人身上观察到的独特质量预测。因此,我们在镜反反振动突扰动中测试了现代深层RL算法的三个主要家庭。令人惊讶的是,所有算法都未能模仿人类行为,而且确实展示出与基于错误的与基于错误的另一种行为不同的行为。为了填补这一空白,我们引入了一个全新的RL算法:基于模型的确定性梯度(MB-DPG),这些M-DPG从基于错误的学习中汲取灵感,明确依靠观察到的模型的距离。我们展示了基于MB-DPG的基于人类的路径在镜反反反向和旋转的周期间进行深度的深度的路径上的学习。我们展示了在目前快速的模型和旋转的模型中较快速的轨道上可以学习一个基于错误的方法。