Existing offline reinforcement learning (RL) algorithms typically assume that training data is either: 1) generated by a known policy, or 2) of entirely unknown origin. We consider multi-demonstrator offline RL, a middle ground where we know which demonstrators generated each dataset, but make no assumptions about the underlying policies of the demonstrators. This is the most natural setting when collecting data from multiple human operators, yet remains unexplored. Since different demonstrators induce different data distributions, we show that this can be naturally framed as a domain generalization problem, with each demonstrator corresponding to a different domain. Specifically, we propose Domain-Invariant Model-based Offline RL (DIMORL), where we apply Risk Extrapolation (REx) (Krueger et al., 2020) to the process of learning dynamics and rewards models. Our results show that models trained with REx exhibit improved domain generalization performance when compared with the natural baseline of pooling all demonstrators' data. We observe that the resulting models frequently enable the learning of superior policies in the offline model-based RL setting, can improve the stability of the policy learning process, and potentially enable increased exploration.
翻译:现有的离线强化学习( RL) 算法通常假定培训数据要么(1) 由已知的政策生成, 要么由完全未知的源头2 生成。 我们考虑多个离线 RL 演示器, 这是一个中间点,我们知道哪些示威者生成了每个数据集,但没有对示威者的基本政策做出任何假设。 这是从多个人类操作者收集数据的最自然环境, 但仍没有被探索。 由于不同的示威者诱发不同的数据分布, 我们显示, 这可以自然地被框为一个域性通用问题, 每个演示人对应不同的域。 具体地说, 我们提议以 Domain- Inversion 模型为基础的离线 RL ( DIMOR), 在那里我们应用风险外推法( REx) ( Krueger et al. 2020) 来学习动态和奖赏模型。 我们的结果表明, 与 REx 培训的模型相比, 将所有示威者数据汇集的自然基线, 展示了更好的域性通用性表现。 我们观察到, 由此形成的模型经常有助于学习基于离线模型设置的高级政策, 能够提高政策稳定性, 并可能促进探索。