In real scenarios, state observations that an agent observes may contain measurement errors or adversarial noises, misleading the agent to take suboptimal actions or even collapse while training. In this paper, we study the training robustness of distributional Reinforcement Learning~(RL), a class of state-of-the-art methods that estimate the whole distribution, as opposed to only the expectation, of the total return. Firstly, we validate the contraction of both expectation-based and distributional Bellman operators in the State-Noisy Markov Decision Process~(SN-MDP), a typical tabular case that incorporates both random and adversarial state observation noises. Beyond SN-MDP, we then analyze the vulnerability of least squared loss in expectation-based RL with either linear or nonlinear function approximation. By contrast, we theoretically characterize the bounded gradient norm of distributional RL loss based on the histogram density estimation. The resulting stable gradients while the optimization in distributional RL accounts for its better training robustness against state observation noises. Finally, extensive experiments on the suite of games verified the convergence of both expectation-based and distributional RL in the SN-MDP-like setting under different strengths of state observation noises. More importantly, in noisy settings beyond SN-MDP, distributional RL is less vulnerable against noisy state observations compared with its expectation-based counterpart.
翻译:在实际情景中,代理观察到的状态观测可能包含测量错误或对抗性噪音,误导代理在培训时采取次优化行动,甚至崩溃。在本文中,我们研究分布式强化学习(RL)的培训可靠性,这是一种最先进的方法,用来估计整个分布,而不是仅仅估计总回报的预期值。首先,我们验证国家-Noisy Markov 决策程序(SN-MDP)中基于预期的和基于分配的Bellman操作员的收缩情况,这是一个典型的表格案例,既包括随机的,也包括对抗性国家观测的噪音。除了SN-MDP外,我们然后用线性或非线性功能的接近来分析基于预期的RL(RL)中最差的损耗。相比之下,我们根据直图密度估计,从理论上确定分布式RL损失的受约束梯度标准。因此,在分配式RL账户账户账户中优化了对州对州观测噪音的强度培训。最后,在游戏套件上进行的广泛实验,验证了基于期望性和分布式的RDP观测结果与比更低的RDP在SN-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-la 的比比比比比的弹性的弹性的弹性度,在SN-R-R-R-R-R-R-R-R-R-R-R-R-R-R-S-R-S-S-R-R-R-R-S-S-S-R-R-R-R-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S