As a framework for sequential decision-making, Reinforcement Learning (RL) has been regarded as an essential component leading to Artificial General Intelligence (AGI). However, RL is often criticized for having the same training environment as the test one, which also hinders its application in the real world. To mitigate this problem, Distributionally Robust RL (DRRL) is proposed to improve the worst performance in a set of environments that may contain the unknown test environment. Due to the nonlinearity of the robustness goal, most of the previous work resort to the model-based approach, learning with either an empirical distribution learned from the data or a simulator that can be sampled infinitely, which limits their applications in simple dynamics environments. In contrast, we attempt to design a DRRL algorithm that can be trained along a single trajectory, i.e., no repeated sampling from a state. Based on the standard Q-learning, we propose distributionally robust Q-learning with the single trajectory (DRQ) and its average-reward variant named differential DRQ. We provide asymptotic convergence guarantees and experiments for both settings, demonstrating their superiority in the perturbed environments against the non-robust ones.
翻译:作为顺序决策的框架,加强学习(RL)被认为是导致人工一般情报(AGI)的基本组成部分。然而,RL常常因为拥有与测试环境相同的培训环境而受到批评,这种环境也阻碍在现实世界中应用,因此也妨碍在现实世界中应用。为了缓解这一问题,建议Spremoly robust RL(DRRL)在可能包含未知测试环境的一组环境中改进最差的性能。由于稳健目标的不线性,大多数先前的工作都采用基于模型的方法,学习从数据中吸取的经验性分布或可以无限抽样的模拟器,从而限制其在简单动态环境中的应用。相比之下,我们试图设计DRRL算法,这种算法可以按照单一的轨迹来训练,即不重复来自一个国家的采样。根据标准的Q学习,我们建议以单一轨迹(DRQ)及其平均向不同的变式进行分配性强的学习。我们为两种环境提供了非周期性趋同保证和实验,以两种环境的优越性向两种环境展示其优越性。