Natural agents can effectively learn from multiple data sources that differ in size, quality, and types of measurements. We study this heterogeneity in the context of offline reinforcement learning (RL) by introducing a new, practically motivated semi-supervised setting. Here, an agent has access to two sets of trajectories: labelled trajectories containing state, action, reward triplets at every timestep, along with unlabelled trajectories that contain only state and reward information. For this setting, we develop and study a simple meta-algorithmic pipeline that learns an inverse dynamics model on the labelled data to obtain proxy-labels for the unlabelled data, followed by the use of any offline RL algorithm on the true and proxy-labelled trajectories. Empirically, we find this simple pipeline to be highly successful - on several D4RL benchmarks, certain offline RL algorithms can match the performance of variants trained on a fully labelled dataset even when we label only 10% trajectories from the low return regime. To strengthen our understanding, we perform a large-scale controlled empirical study investigating the interplay of data-centric properties of the labelled and unlabelled datasets, with algorithmic design choices (e.g., choice of inverse dynamics, offline RL algorithm) to identify general trends and best practices for training RL agents on semi-supervised offline datasets.
翻译:自然物剂可以有效地从不同大小、 质量和测量类型不同的多种数据源中学习。 我们通过引入一个新的、 实际驱动的半监督设置来研究离线强化学习( RL) 背景下的这种异质性。 在这里, 一种物剂可以使用两套轨迹: 贴标签的轨迹, 包含状态、 动作、 每一步奖励三胞胎, 以及只包含状态和奖赏信息的不贴标签的轨迹。 对于这个设置, 我们制定并研究一个简单的元- 数学管道, 学习标签上的数据的反动态模型, 以获得无标签数据的代理标签。 之后, 在真实和代名标签轨迹上使用任何离线的 RL 算法算法。 我们发现这个简单的管道非常成功 - 在几个 D4RLL 基准上, 某些离线 RL 算法可以匹配在完全标签的半透明数据集上训练的变体的性功能。 即使我们只标有10% 从低回报系统上输入的反向轨迹, 也能够加强我们的理解, 在真实和常规数据变体分析中, 我们进行一个大规模的数据分析模型的模型的模型设计, 我们进行大规模的模型分析, 的模型的模型的模型设计, 我们进行大规模数据分析。