We study a new paradigm for sequential decision making, called offline Policy Learning from Observation (PLfO). Offline PLfO aims to learn policies using datasets with substandard qualities: 1) only a subset of trajectories is labeled with rewards, 2) labeled trajectories may not contain actions, 3) labeled trajectories may not be of high quality, and 4) the overall data may not have full coverage. Such imperfection is common in real-world learning scenarios, so offline PLfO encompasses many existing offline learning setups, including offline imitation learning (IL), ILfO, and reinforcement learning (RL). In this work, we present a generic approach, called Modality-agnostic Adversarial Hypothesis Adaptation for Learning from Observations (MAHALO), for offline PLfO. Built upon the pessimism concept in offline RL, MAHALO optimizes the policy using a performance lower bound that accounts for uncertainty due to the dataset's insufficient converge. We implement this idea by adversarially training data-consistent critic and reward functions in policy optimization, which forces the learned policy to be robust to the data deficiency. We show that MAHALO consistently outperforms or matches specialized algorithms across a variety of offline PLfO tasks in theory and experiments.
翻译:我们研究了一个称为来自观察的离线策略学习(PLfO)的新的顺序决策制定范式。离线 PLfO 旨在使用具有次优质量的数据集学习策略:1)仅对一部分轨迹进行奖励标记,2)带有标签的轨迹可能不包含动作,3)带有标签的轨迹可能质量不高,4)整体数据可能没有全面覆盖。这种缺陷在实际学习场景中很常见,所以离线 PLfO 包括许多现有的离线学习设置,包括离线模仿学习(IL)、ILfO 和强化学习(RL)等。在本文中,我们提出了一个通用的方法,称为用于来自观察的学习的模态不可知的对抗性假设自适应 (MAHALO),用于离线 PLfO。建立在离线 RL 的悲观主义概念之上,MAHALO 使用由于数据集不足而产生的不确定性的性能下限来优化策略。我们通过在策略优化中对抗性地训练数据一致的评论家和奖励函数来实现这个想法,从而迫使学到的策略对数据的缺陷具有鲁棒性。我们在理论和实验中展示了 MAHALO 在各种离线 PLfO 任务中始终优于或与专业算法相匹配。