One stream of reinforcement learning research is exploring biologically plausible models and algorithms to simulate biological intelligence and fit neuromorphic hardware. Among them, reward-modulated spike-timing-dependent plasticity (R-STDP) is a recent branch with good potential in energy efficiency. However, current R-STDP methods rely on heuristic designs of local learning rules, thus requiring task-specific expert knowledge. In this paper, we consider a spiking recurrent winner-take-all network, and propose a new R-STDP method, spiking variational policy gradient (SVPG), whose local learning rules are derived from the global policy gradient and thus eliminate the need for heuristic designs. In experiments of MNIST classification and Gym InvertedPendulum, our SVPG achieves good training performance, and also presents better robustness to various kinds of noises than conventional methods.
翻译:强化学习研究的一流内容是探索生物上可行的模型和算法,以模拟生物智能和安装神经形态硬件。其中,奖励性调控的悬浮刺激依赖塑料(R-STDP)是最近的一个分支,在能源效率方面具有良好的潜力。然而,目前的R-STDP方法依赖于当地学习规则的超常设计,因此需要特定任务的专家知识。 在本文中,我们考虑一个反复出现的赢家通吃网络,并提出一种新的R-STDP方法,即跳动变异政策梯度(SVPG),其本地学习规则源自全球政策梯度,从而消除了对超自然设计的需求。 在MNIST分类和Gym InverectedPendulum的实验中,我们的SVPG取得了良好的培训业绩,并且对各种噪音比常规方法更强的力度。