Deep reinforcement learning (DRL) requires the collection of plenty of interventional data, which is sometimes expensive and even unethical in the real world, such as in the autonomous driving and the medical field. Offline reinforcement learning promises to alleviate this issue by exploiting the vast amount of observational data available in the real world. However, observational data may mislead the learning agent to undesirable outcomes if the behavior policy that generates the data depends on unobserved random variables (i.e., confounders). In this paper, we propose two deconfounding methods in DRL to address this problem. The methods first calculate the importance degree of different samples based on the causal inference technique, and then adjust the impact of different samples on the loss function by reweighting or resampling the offline dataset to ensure its unbiasedness. These deconfounding methods can be flexibly combined with the existing model-free DRL algorithms such as soft actor-critic and deep Q-learning, provided that a weak condition can be satisfied by the loss functions of these algorithms. We prove the effectiveness of our deconfounding methods and validate them experimentally.
翻译:深度强化学习(DRL)需要收集大量干预性数据,有时这些数据在现实世界(如自主驾驶和医疗领域)是昂贵的,甚至是不道德的。 在线强化学习通过利用现实世界现有的大量观测数据,有望缓解这一问题。然而,如果生成数据的行为政策依赖于未观测的随机变量(即混杂者),观测数据可能会将学习者误导为不可取的结果。在本文中,我们建议DRL用两种分解方法来解决这一问题。方法首先根据因果关系推断技术计算不同样本的重要性,然后通过重新加权或重标取离线数据集来调整不同样本对损失功能的影响,以确保其公正性。这些分解方法可以灵活地与现有的无模型的DRL算法(如软的演员和深层次的学习者)相结合,只要这些算法的损失功能能够满足薄弱的条件。我们证明了我们解析方法的有效性,并实验性地验证了这些方法。