行动调控的中中型脑多巴胺活动源自分布式控制政策 (Action-modulated midbrain dopamine activity arises from distributed control policies)

Animal behavior is driven by multiple brain regions working in parallel with distinct control policies. We present a biologically plausible model of off-policy reinforcement learning in the basal ganglia, which enables learning in such an architecture. The model accounts for action-related modulation of dopamine activity that is not captured by previous models that implement on-policy algorithms. In particular, the model predicts that dopamine activity signals a combination of reward prediction error (as in classic models) and "action surprise," a measure of how unexpected an action is relative to the basal ganglia's current policy. In the presence of the action surprise term, the model implements an approximate form of Q-learning. On benchmark navigation and reaching tasks, we show empirically that this model is capable of learning from data driven completely or in part by other policies (e.g. from other brain regions). By contrast, models without the action surprise term suffer in the presence of additional policies, and are incapable of learning at all from behavior that is completely externally driven. The model provides a computational account for numerous experimental findings about dopamine activity that cannot be explained by classic models of reinforcement learning in the basal ganglia. These include differing levels of action surprise signals in dorsal and ventral striatum, decreasing amounts movement-modulated dopamine activity with practice, and representations of action initiation and kinematics in dopamine activity. It also provides further predictions that can be tested with recordings of striatal dopamine activity.

翻译：由多个大脑区域与不同的控制政策并行推动的动物行为。我们展示了一种生物上可信的模型, 在巴萨尔帮派中进行脱离政策强化学习, 从而能够在这样的结构中学习。模型说明了过去实施政策算法的模型所没有捕捉到的多巴胺活动与行动有关的调节。特别是, 模型预测多巴胺活动是奖励预测错误( 如经典模型)和“ 行动突变”的结合, 一种衡量出乎意料的行动与巴萨尔帮派当前政策相比的对比度。在行动突变术语中, 模型采用一种大致的Q学习形式。在基准导航和达成任务方面, 我们从经验上显示, 该模型能够完全或部分地从实施其他政策( 例如来自其他大脑区域)驱动的数据中学习。相比之下, 多巴胺活动是奖励预测错误的结合( 如经典模型) 奖励预测错误( 如经典模型一样) 和“ 行动突变异行动 ”, 无法从完全由外部驱动的行为中吸取任何教训。模型为关于多巴胺活动的众多实验性发现提供了一个计算账户。在行动上无法用典型的模型来解释的Q学习的学习 Q- 。在组织运动运动运动动作动作中学习运动动作的模型中, 这些运动动作的模型中, 的模型中, 也提供不同的运动动作动作动作动作动作动作的动作的动作的模型提供不断变数。