MAHALO: 将离线强化学习和来自观察的模仿学习统一起来 (MAHALO: Unifying Offline Reinforcement Learning and Imitation Learning from Observations)

We study a new paradigm for sequential decision making, called offline Policy Learning from Observation (PLfO). Offline PLfO aims to learn policies using datasets with substandard qualities: 1) only a subset of trajectories is labeled with rewards, 2) labeled trajectories may not contain actions, 3) labeled trajectories may not be of high quality, and 4) the overall data may not have full coverage. Such imperfection is common in real-world learning scenarios, so offline PLfO encompasses many existing offline learning setups, including offline imitation learning (IL), ILfO, and reinforcement learning (RL). In this work, we present a generic approach, called Modality-agnostic Adversarial Hypothesis Adaptation for Learning from Observations (MAHALO), for offline PLfO. Built upon the pessimism concept in offline RL, MAHALO optimizes the policy using a performance lower bound that accounts for uncertainty due to the dataset's insufficient converge. We implement this idea by adversarially training data-consistent critic and reward functions in policy optimization, which forces the learned policy to be robust to the data deficiency. We show that MAHALO consistently outperforms or matches specialized algorithms across a variety of offline PLfO tasks in theory and experiments.

翻译：我们研究了一个称为来自观察的离线策略学习（PLfO）的新的顺序决策制定范式。离线 PLfO 旨在使用具有次优质量的数据集学习策略：1）仅对一部分轨迹进行奖励标记，2）带有标签的轨迹可能不包含动作，3）带有标签的轨迹可能质量不高，4）整体数据可能没有全面覆盖。这种缺陷在实际学习场景中很常见，所以离线 PLfO 包括许多现有的离线学习设置，包括离线模仿学习（IL）、ILfO 和强化学习（RL）等。在本文中，我们提出了一个通用的方法，称为用于来自观察的学习的模态不可知的对抗性假设自适应 (MAHALO)，用于离线 PLfO。建立在离线 RL 的悲观主义概念之上，MAHALO 使用由于数据集不足而产生的不确定性的性能下限来优化策略。我们通过在策略优化中对抗性地训练数据一致的评论家和奖励函数来实现这个想法，从而迫使学到的策略对数据的缺陷具有鲁棒性。我们在理论和实验中展示了 MAHALO 在各种离线 PLfO 任务中始终优于或与专业算法相匹配。

相关内容

模仿学习

关注 322

模仿学习是学习尝试模仿专家行为从而获取最佳性能的一系列任务。目前主流方法包括监督式模仿学习、随机混合迭代学习和数据聚合模拟学习等方法。模仿学习（Imitation Learning）背后的原理是是通过隐含地给学习器关于这个世界的先验信息，比如执行、学习人类行为。在模仿学习任务中，智能体（agent）为了学习到策略从而尽可能像人类专家那样执行一种行为，它会寻找一种最佳的方式来使用由该专家示范的训练集（输入-输出对）。当智能体学习人类行为时，虽然我们也需要使用模仿学习，但实时的行为模拟成本会非常高。与之相反，吴恩达提出的学徒学习（Apprenticeship learning）执行的是存粹的贪婪/利用（exploitative）策略，并使用强化学习方法遍历所有的（状态和行为）轨迹（trajectories）来学习近优化策略。它需要极难的计略（maneuvers），而且几乎不可能从未观察到的状态还原。模仿学习能够处理这些未探索到的状态，所以可为自动驾驶这样的许多任务提供更可靠的通用框架。

JCIM丨DRlinker：深度强化学习优化片段连接设计

专知会员服务

7+阅读 · 2022年12月9日

【MIla】一种意识启发规划的基于模型强化学习，A Consciousness-Inspired Planning Agent for Model-Based Reinforcement Learning

专知会员服务

23+阅读 · 2022年3月19日

【AAAI2021】自校正Q学习，Self-correcting Q-Learning

专知会员服务

17+阅读 · 2020年12月4日

图像分类半监督自监督无监督学习综述，A survey on Semi-, Self- and Unsupervised Learning for Image Classification

专知会员服务

46+阅读 · 2020年7月29日