Offline reinforcement learning (RL) have received rising interest due to its appealing data efficiency. The present study addresses behavior estimation, a task that lays the foundation of many offline RL algorithms. Behavior estimation aims at estimating the policy with which training data are generated. In particular, this work considers a scenario where the data are collected from multiple sources. In this case, neglecting data heterogeneity, existing approaches for behavior estimation suffers from behavior misspecification. To overcome this drawback, the present study proposes a latent variable model to infer a set of policies from data, which allows an agent to use as behavior policy the policy that best describes a particular trajectory. This model provides with a agent fine-grained characterization for multi-source data and helps it overcome behavior misspecification. This work also proposes a learning algorithm for this model and illustrates its practical usage via extending an existing offline RL algorithm. Lastly, with extensive evaluation this work confirms the existence of behavior misspecification and the efficacy of the proposed model.
翻译:离线强化学习由于其出色的数据效率而受到越来越多的关注。本研究探讨行为估计, 这是许多离线强化学习算法的基础。行为估计旨在估计生成训练数据的策略。具体而言, 本文考虑了从多个来源收集数据的情况。在这种情况下,忽略数据的异质性, 现有的行为截断方法受到了行为估计错误的影响。为了克服这一缺点,本研究提出了一个潜变量模型来推断一组政策,允许一个代理通过描述一个特定的轨迹最好的策略来使用行为策略。该模型为多源数据提供了代理精细的特征化,并帮助其克服行为估计错误。本文还提出了这个模型的学习算法,并通过扩展一个现有的离线强化学习算法来说明其实用性。最后,通过广泛的评估,本文验证了行为估计错误的存在和所提出模型的有效性。