Offline reinforcement learning (RL) have received rising interest due to its appealing data efficiency. The present study addresses behavior estimation, a task that lays the foundation of many offline RL algorithms. Behavior estimation aims at estimating the policy with which training data are generated. In particular, this work considers a scenario where the data are collected from multiple sources. In this case, neglecting data heterogeneity, existing approaches for behavior estimation suffers from behavior misspecification. To overcome this drawback, the present study proposes a latent variable model to infer a set of policies from data, which allows an agent to use as behavior policy the policy that best describes a particular trajectory. This model provides with a agent fine-grained characterization for multi-source data and helps it overcome behavior misspecification. This work also proposes a learning algorithm for this model and illustrates its practical usage via extending an existing offline RL algorithm. Lastly, with extensive evaluation this work confirms the existence of behavior misspecification and the efficacy of the proposed model.
翻译:离线强化学习( RL) 因其具有吸引力的数据效率而得到越来越多的关注。 本研究涉及行为估计,这是许多离线 RL 算法的基础。 行为估计旨在估计产生培训数据的政策。 特别是, 这项工作考虑了从多种来源收集数据的情景。 在这种情况下, 忽略数据差异性, 现有行为估计方法存在行为区分错误。 为了克服这一缺陷, 本研究提出了一个潜在变量模型, 从数据中推导出一套政策, 使代理商能够将最能描述特定轨迹的政策用作行为政策。 该模型为多源数据提供了一个精细的代理商特征描述, 帮助其克服行为区分错误。 这项工作还提出了这一模型的学习算法, 并通过扩展现有的离线 RL 算法来说明其实际用途。 最后, 通过广泛的评价, 这项工作证实了行为区分错误的存在以及拟议模型的有效性。