A long-standing goal of reinforcement learning is that algorithms can learn on training tasks and generalize well on unseen tasks like humans, where different tasks share similar dynamic with different reward functions. A general challenge is that it is nontrivial to quantitatively measure the similarities between these different tasks, which is vital for analyzing the task distribution and further designing algorithms with stronger generalization. To address this, we present a novel metric named Task Distribution Relevance (TDR) via optimal Q functions to capture the relevance of the task distribution quantitatively. In the case of tasks with a high TDR, i.e., the tasks differ significantly, we demonstrate that the Markovian policies cannot distinguish them, yielding poor performance accordingly. Based on this observation, we propose a framework of Reward Informed Dreamer (RID) with reward-informed world models, which captures invariant latent features over tasks and encodes reward signals into policies for distinguishing different tasks. In RID, we calculate the corresponding variational lower bound of the log-likelihood on the data, which includes a novel term to distinguish different tasks via states, based on reward-informed world models. Finally, extensive experiments in DeepMind control suite demonstrate that RID can significantly improve the performance of handling different tasks at the same time, especially for those with high TDR, and further generalize to unseen tasks effectively.
翻译:长期的强化学习目标是,算法可以学习培训任务,并很好地推广人类等不可见的任务,不同的任务具有类似的动态,具有不同的奖赏功能。一个普遍的挑战是,在数量上衡量这些不同任务之间的相似性方面,对于分析任务分配和进一步设计更加概括化的算法至关重要。为了解决这个问题,我们通过最佳的Q函数,提出了一个名为任务分配相关性的新颖的衡量标准,以量化的方式反映任务分配的相关性。在高TDR的任务中,即任务差异很大,我们证明马尔科维亚政策无法区分它们,因此造成业绩不佳。根据这一观察,我们提出了一个具有奖励信息化的梦想家框架,其中含有报酬信息化的世界模型,其中捕捉了任务方面不断变化的潜在特征,并将奖赏信号输入到政策中以区分不同任务。在RID中,我们计算对数据的日志相似程度的相应差异性更小,包括一个通过国家区分不同任务的新术语,其基础是报酬化的、结果性差。最后,我们提议一个具有奖励信息化的LEDREDR的模型,能够以不同程度地、更精确地展示一般的成绩,在TMMRMRM的模型中进行广泛的实验。</s>