强化学习,有不稳定的奖励 (Reinforcement Learning with Perturbed Rewards)

Recent studies have shown the vulnerability of reinforcement learning (RL) models in noisy settings. The sources of noises differ across scenarios. For instance, in practice, the observed reward channel is often subject to noise (e.g., when observed rewards are collected through sensors), and thus observed rewards may not be credible as a result. Also, in applications such as robotics, a deep reinforcement learning (DRL) algorithm can be manipulated to produce arbitrary errors. In this paper, we consider noisy RL problems where observed rewards by RL agents are generated with a reward confusion matrix. We call such observed rewards as perturbed rewards. We develop an unbiased reward estimator aided robust RL framework that enables RL agents to learn in noisy environments while observing only perturbed rewards. Our framework draws upon approaches for supervised learning with noisy data. The core ideas of our solution include estimating a reward confusion matrix and defining a set of unbiased surrogate rewards. We prove the convergence and sample complexity of our approach. Extensive experiments on different DRL platforms show that policies based on our estimated surrogate reward can achieve higher expected rewards, and converge faster than existing baselines. For instance, the state-of-the-art PPO algorithm is able to obtain 67.5% and 46.7% improvements in average on five Atari games, when the error rates are 10% and 30% respectively.

翻译：最近的研究显示,在吵闹的环境中,强化学习模式(RL)的脆弱程度是最近的研究显示,在吵闹的环境中,噪音的来源是不同的。例如,在实践中,观察到的奖励渠道往往受到噪音的影响(例如,通过传感器收集观察到的奖励),因此,观察到的奖励渠道往往受到噪音的影响,结果可能不可信。此外,在机器人等应用中,深度强化学习(DRL)算法可以被操纵,产生任意错误。在本文中,我们考虑到吵闹的RL问题,因为通过奖励混乱矩阵来产生RL代理的观察奖励。我们把观察到的奖励称为过激的奖励。我们开发了一个公正的奖励估计支持强大的RL框架,使RL代理在吵闹的环境下学习,而只观察过激的奖励。我们的框架利用以噪音数据监督的学习方法。我们解决方案的核心想法包括估计奖励混乱矩阵和确定一套不带偏见的替代奖赏。我们在不同的DRL平台上进行广泛的实验表明,基于我们估计的替代奖赏的政策可以达到更高的预期的奖赏,而且比现有的46 %的利率要快。