Robust reinforcement learning (RL) considers the problem of learning policies that perform well in the worst case among a set of possible environment parameter values. In real-world environments, choosing the set of possible values for robust RL can be a difficult task. When that set is specified too narrowly, the agent will be left vulnerable to reasonable parameter values unaccounted for. When specified too broadly, the agent will be too cautious. In this paper, we propose Feasible Adversarial Robust RL (FARR), a novel problem formulation and objective for automatically determining the set of environment parameter values over which to be robust. FARR implicitly defines the set of feasible parameter values as those on which an agent could achieve a benchmark reward given enough training resources. By formulating this problem as a two-player zero-sum game, optimizing the FARR objective jointly produces an adversarial distribution over parameter values with feasible support and a policy robust over this feasible parameter set. We demonstrate that approximate Nash equilibria for this objective can be found using a variation of the PSRO algorithm. Furthermore, we show that an optimal agent trained with FARR is more robust to feasible adversarial parameter selection than with existing minimax, domain-randomization, and regret objectives in a parameterized gridworld and three MuJoCo control environments.
翻译:强力强化学习( RL) 考虑学习政策的问题, 在一套可能的环境参数值中, 最差的学习政策效果良好。 在现实世界环境中, 为稳健的 RL 选择一组可能的值可能是一个困难的任务。 当该组定义过窄时, 代理商将很容易被合理参数值忽略。 如果定义过宽, 代理商将过于谨慎。 在本文中, 我们提议一个新颖的问题和目的, 自动确定一套环境参数值, 并自动确定该环境参数值是否稳健。 FARR 隐含地定义了一套可行的参数值, 即一个代理商在有足够培训资源的情况下能够取得基准奖赏的参数值。 通过将这一问题描述为双玩零和游戏, 优化 FARR 目标将很容易被忽略到一个合理的参数值值值上。 当该参数值被过于广泛指定时, 我们证明, 使用PSRO 算法的变换方法可以找到该目标的近似值。 此外, 我们表示, 受FARRRR 培训的最佳代理商将更可靠, 以可行的对抗性参数参数参数选择比现有微软体磁带磁带 。