To apply reinforcement learning (RL) to real-world applications, agents are required to adhere to the safety guidelines of their respective domains. Safe RL can effectively handle the guidelines by converting them into constraints of the RL problem. In this paper, we develop a safe distributional RL method based on the trust region method, which can satisfy constraints consistently. However, policies may not meet the safety guidelines due to the estimation bias of distributional critics, and importance sampling required for the trust region method can hinder performance due to its significant variance. Hence, we enhance safety performance through the following approaches. First, we train distributional critics to have low estimation biases using proposed target distributions where bias-variance can be traded off. Second, we propose novel surrogates for the trust region method expressed with Q-functions using the reparameterization trick. Additionally, depending on initial policy settings, there can be no policy satisfying constraints within a trust region. To handle this infeasible issue, we propose a gradient integration method which guarantees to find a policy satisfying all constraints from an unsafe initial policy. From extensive experiments, the proposed method with risk-averse constraints shows minimal constraint violations while achieving high returns compared to existing safe RL methods.
翻译:将强化学习(RL)应用到现实世界应用中,要求代理机构遵守各自领域的安全准则;安全RL能够有效地处理准则,将准则转化为限制RL问题;在本文中,我们根据信任区域方法制定了安全分配RL方法,可以持续地满足限制;然而,由于分配批评者的估计偏差,政策可能不符合安全准则,而信任区域方法所需的重要抽样可能因其显著差异而妨碍业绩;因此,我们通过下列方法提高安全性能:首先,我们培训分配批评者,利用拟议的目标分配方法,将准则变成可交换偏差的偏差,从而降低估计偏差;其次,我们建议对信任区域方法采用新颖的套用信任区域方法,使用重新校准技术,以功能表示信任区域方法;此外,根据最初的政策环境,信任区域内不可能有满足政策的限制;为了处理这一不现实的问题,我们建议采用梯度整合方法,保证从不安全的初始政策中找到满足所有限制的政策。从广泛的试验中,拟议的风险反限制方法显示最低的限制,同时实现高回报。