We present a novel off-policy loss function for learning a transition model in model-based reinforcement learning. Notably, our loss is derived from the off-policy policy evaluation objective with an emphasis on correcting distribution shift. Compared to previous model-based techniques, our approach allows for greater robustness under model misspecification or distribution shift induced by learning/evaluating policies that are distinct from the data-generating policy. We provide a theoretical analysis and show empirical improvements over existing model-based off-policy evaluation methods. We provide further analysis showing our loss can be used for off-policy optimization (OPO) and demonstrate its integration with more recent improvements in OPO.
翻译:我们提出了一种新的政策外损失功能,用于学习基于模型的强化学习的过渡模式。值得注意的是,我们的损失来自非政策性政策评价目标,重点是纠正分配的转变。与以往基于模型的技术相比,我们的方法允许在与数据产生政策不同的学习/评价政策引起的模式性偏差或分配转移下加强力度。我们提供了理论分析,并展示了对现有基于模型的非政策评价方法的经验改进。我们提供了进一步分析,表明我们的损失可以用于非政策优化(OPO),并表明它与最近对OPO的改进相结合。