We study a class of reinforcement learning problems where the reward signals for policy learning are generated by a discriminator that is dependent on and jointly optimized with the policy. This interdependence between the policy and the discriminator leads to an unstable learning process because reward signals from an immature discriminator are noisy and impede policy learning, and conversely, an untrained policy impedes discriminator learning. We call this learning setting $\textit{Internally Rewarded Reinforcement Learning}$ (IRRL) as the reward is not provided directly by the environment but $\textit{internally}$ by the discriminator. In this paper, we formally formulate IRRL and present a class of problems that belong to IRRL. We theoretically derive and empirically analyze the effect of the reward function in IRRL and based on these analyses propose the clipped linear reward function. Experimental results show that the proposed reward function can consistently stabilize the training process by reducing the impact of reward noise, which leads to faster convergence and higher performance compared with baselines in diverse tasks.
翻译:我们研究的是一类强化学习问题,因为政策学习的奖励信号是由依赖该政策并与该政策共同优化的一位歧视者产生的。这一政策与歧视者之间的相互依存关系导致一个不稳定的学习过程,因为来自不成熟歧视者的奖励信号是吵闹的,阻碍了政策学习,反之,不经过培训的政策阻碍歧视者学习。我们称这种学习设置为$\textit{Internal Reward Reward Enternational Learning}$(IRRL),因为奖励不是由环境直接提供的,而是由歧视者内部提供的$\textit{$(interly $)。在本文中,我们正式制定了IRRL,并提出了属于IRRL的一类问题。我们从理论上对IRR的奖励功能的影响进行了实验性分析,并根据这些分析提出了剪裁线性奖励功能。实验结果表明,拟议的奖励功能可以通过减少奖励噪音的影响来稳定培训过程,从而导致更快的趋同和更高的业绩与不同任务的基线相比。