The generalization of model-based reinforcement learning (MBRL) methods to environments with unseen transition dynamics is an important yet challenging problem. Existing methods try to extract environment-specified information $Z$ from past transition segments to make the dynamics prediction model generalizable to different dynamics. However, because environments are not labelled, the extracted information inevitably contains redundant information unrelated to the dynamics in transition segments and thus fails to maintain a crucial property of $Z$: $Z$ should be similar in the same environment and dissimilar in different ones. As a result, the learned dynamics prediction function will deviate from the true one, which undermines the generalization ability. To tackle this problem, we introduce an interventional prediction module to estimate the probability of two estimated $\hat{z}_i, \hat{z}_j$ belonging to the same environment. Furthermore, by utilizing the $Z$'s invariance within a single environment, a relational head is proposed to enforce the similarity between $\hat{{Z}}$ from the same environment. As a result, the redundant information will be reduced in $\hat{Z}$. We empirically show that $\hat{{Z}}$ estimated by our method enjoy less redundant information than previous methods, and such $\hat{{Z}}$ can significantly reduce dynamics prediction errors and improve the performance of model-based RL methods on zero-shot new environments with unseen dynamics. The codes of this method are available at \url{https://github.com/CR-Gjx/RIA}.
翻译:将基于模型的强化学习(MBRL)方法推广到具有看不见过渡动态的环境是一个重要但具有挑战性的问题。现有的方法试图从过去的过渡阶段提取环境指定的信息,从过去的过渡阶段提取Z$Z美元,以使动态预测模型普遍适用于不同的动态。然而,由于环境没有贴标签,所提取的信息不可避免地含有与过渡阶段动态无关的多余信息,因此无法保持一个关键属性:在同一环境中,Z美元应当相似,不同环境中则不同。因此,所学的动态预测功能将偏离真实功能,而真实功能会损害总体化能力。为了解决这一问题,我们引入了一个干预性预测模块,以估计属于同一环境的两种估计 $\ hat{z ⁇ i,\ hat{j{j} jrj。此外,由于在单一环境中使用$Z$的波动,因此建议使用一个关系头来实施同一环境中$\hat%美元之间的类似特性。因此,多余的信息将减少$\com$。我们从实验上显示,以$\ hate_rx_rxx的当前预测方法可以大大地使用重复性环境。