如何在离线强化学习中利用无标签数据 (How to Leverage Unlabeled Data in Offline Reinforcement Learning)

Offline reinforcement learning (RL) can learn control policies from static datasets but, like standard RL methods, it requires reward annotations for every transition. In many cases, labeling large datasets with rewards may be costly, especially if those rewards must be provided by human labelers, while collecting diverse unlabeled data might be comparatively inexpensive. How can we best leverage such unlabeled data in offline RL? One natural solution is to learn a reward function from the labeled data and use it to label the unlabeled data. In this paper, we find that, perhaps surprisingly, a much simpler method that simply applies zero rewards to unlabeled data leads to effective data sharing both in theory and in practice, without learning any reward model at all. While this approach might seem strange (and incorrect) at first, we provide extensive theoretical and empirical analysis that illustrates how it trades off reward bias, sample complexity and distributional shift, often leading to good results. We characterize conditions under which this simple strategy is effective, and further show that extending it with a simple reweighting approach can further alleviate the bias introduced by using incorrect reward labels. Our empirical evaluation confirms these findings in simulated robotic locomotion, navigation, and manipulation settings.

翻译：离线强化学习( RL) 可以从静态数据集中学习控制政策, 但像标准 RL 方法一样, 它要求每个过渡都得到奖赏说明。在许多情况下, 标出大数据集并标出奖赏, 成本可能很高, 特别是如果这些奖赏必须由人类标签者提供, 同时收集多种未贴标签的数据可能相对便宜。我们如何最好地利用离线的RL 中这种未贴标签的数据? 一个自然的解决办法是从标签数据中学习奖赏功能, 并用它标出未贴标签的数据。本文中, 我们发现, 可能令人惊讶的是, 一个简单简单将零奖赏用于未贴标签的数据的简单简单简单简单方法, 导致在理论和实践上有效分享数据, 而不学习任何奖赏模式。虽然这个方法最初似乎有些奇怪( 和不正确 ), 但我们提供了广泛的理论和经验分析, 来说明它是如何从奖励偏差、抽样复杂和分布变化中交换, 往往导致良好结果的。我们描述了这一简单战略有效的条件, 并且进一步表明, 以简单的重新加权方法扩展它能够进一步减轻错误的偏差变。