We consider an offline reinforcement learning (RL) setting where the agent need to learn from a dataset collected by rolling out multiple behavior policies. There are two challenges for this setting: 1) The optimal trade-off between optimizing the RL signal and the behavior cloning (BC) signal changes on different states due to the variation of the action coverage induced by different behavior policies. Previous methods fail to handle this by only controlling the global trade-off. 2) For a given state, the action distribution generated by different behavior policies may have multiple modes. The BC regularizers in many previous methods are mean-seeking, resulting in policies that select out-of-distribution (OOD) actions in the middle of the modes. In this paper, we address both challenges by using adaptively weighted reverse Kullback-Leibler (KL) divergence as the BC regularizer based on the TD3 algorithm. Our method not only trades off the RL and BC signals with per-state weights (i.e., strong BC regularization on the states with narrow action coverage, and vice versa) but also avoids selecting OOD actions thanks to the mode-seeking property of reverse KL. Empirically, our algorithm can outperform existing offline RL algorithms in the MuJoCo locomotion tasks with the standard D4RL datasets as well as the mixed datasets that combine the standard datasets.
翻译:我们考虑一个脱线强化学习( RL) 设置, 代理机构需要从通过推出多种行为政策而收集的数据集中学习。 此设置有两个挑战:(1) 优化 RL 信号和行为克隆( BC) 信号变化之间的最佳权衡, 原因是不同行为政策导致的行动范围的变化。 以往的方法不能仅通过控制全球交易来解决这个问题。 (2) 对于某个特定国家, 不同行为政策产生的行动分布可能具有多种模式。 BC 先前许多方法的正规化者都是平均寻求的, 导致在模式中间选择分配( OOOOD) 行动的政策。 在本文中, 我们通过使用适应性加权反偏差的 Kullback- Leiber (KL) 和基于TD3 算法的 BC 常规化器来应对这两个挑战。 我们的方法不仅将 RL 和 BC 信号与每个州重量( e) 进行交易( 即, 在行动覆盖面较窄的州进行强大的 BCS 正规化, 反向), 但也避免选择 OOD 动作,, 是因为正在寻求模式属性的 KJoperal 数据组合, 我们的 RVAL 标准数据, 将 数据与现有的 RVal 的 RVals 数据转换为不 数据 。