Recently, improving the robustness of policies across different environments attracts increasing attention in the reinforcement learning (RL) community. Existing robust RL methods mostly aim to achieve the max-min robustness by optimizing the policy's performance in the worst-case environment. However, in practice, a user that uses an RL policy may have different preferences over its performance across environments. Clearly, the aforementioned max-min robustness is oftentimes too conservative to satisfy user preference. Therefore, in this paper, we integrate user preference into policy learning in robust RL, and propose a novel User-Oriented Robust RL (UOR-RL) framework. Specifically, we define a new User-Oriented Robustness (UOR) metric for RL, which allocates different weights to the environments according to user preference and generalizes the max-min robustness metric. To optimize the UOR metric, we develop two different UOR-RL training algorithms for the scenarios with or without a priori known environment distribution, respectively. Theoretically, we prove that our UOR-RL training algorithms converge to near-optimal policies even with inaccurate or completely no knowledge about the environment distribution. Furthermore, we carry out extensive experimental evaluations in 4 MuJoCo tasks. The experimental results demonstrate that UOR-RL is comparable to the state-of-the-art baselines under the average and worst-case performance metrics, and more importantly establishes new state-of-the-art performance under the UOR metric.
翻译:最近,在不同环境中改善政策的稳健性在强化学习(RL)社区中吸引了越来越多的注意力。现有的稳健RL方法主要旨在通过优化最坏环境环境中的政策绩效实现最大稳健性。然而,在实践中,使用RL政策的用户可能具有不同偏好。很显然,上述最大稳健性往往过于保守,无法满足用户的偏好。因此,在本文件中,我们将用户偏好纳入稳健RL的政策学习,并提出了一个新的用户导向的Robust RL(UOR-RL)框架。具体地说,我们为RL定义了新的面向用户的稳健度(UOR)衡量标准(UOR)衡量标准(UOR),根据用户偏好给环境分配不同的权重,并普遍采用最大稳健性衡量标准。为了优化UOR衡量标准,我们为最坏环境或最不为人所知的环境分布两种不同的 UOR-RL培训算法。理论上,我们关于UOR-RL 培训算法的新的UOR-RL(UOR)矩阵算法将接近接近接近近op-Oral-al-alal-al-alal-Ial-al-al-al-al-al-al-Ial-al-Irreval-Iental-Ient-I)政策在实验性环境下进行不精确或完全不精确或完全不精确或完全不精确或完全不精确的实验性的工作分配。在4-al-al-al-I-I-I-al-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-