Many robotic tasks are composed of a lot of temporally correlated sub-tasks in a highly complex environment. It is important to discover situational intentions and proper actions by deliberating on temporal abstractions to solve problems effectively. To understand the intention separated from changing task dynamics, we extend an empowerment-based regularization technique to situations with multiple tasks based on the framework of a generative adversarial network. Under the multitask environments with unknown dynamics, we focus on learning a reward and policy from the unlabeled expert examples. In this study, we define situational empowerment as the maximum of mutual information representing how an action conditioned on both a certain state and sub-task affects the future. Our proposed method derives the variational lower bound of the situational mutual information to optimize it. We simultaneously learn the transferable multi-task reward function and policy by adding an induced term to the objective function. By doing so, the multi-task reward function helps to learn a robust policy for environmental change. We validate the advantages of our approach on multi-task learning and multi-task transfer learning. We demonstrate our proposed method has the robustness of both randomness and changing task dynamics. Finally, we prove that our method has significantly better performance and data efficiency than existing imitation learning methods on various benchmarks.
翻译:许多机器人任务是由在高度复杂的环境中许多与时间相关联的子任务构成的。重要的是要通过思考时间抽象来发现情景意图和适当的行动,从而有效地解决问题。为了理解与变化中的任务动态分离的意图,我们将基于赋权的正规化技术推广到基于基因对抗网络框架的多重任务的情况。在具有未知动态的多任务环境中,我们侧重于从未加标签的专家实例中学习奖励和政策。在这项研究中,我们将状况赋权定义为显示某种行动如何以某种状态和子任务对未来产生影响的相互信息的最大优势。我们建议的方法产生情况共同信息变异的较低范围,以优化它。我们同时学习可转移的多任务奖励功能和政策,在目标功能中添加一个诱导的术语。通过这样做,多任务奖励功能有助于学习强有力的环境变化政策。我们验证了我们关于多任务学习和多任务转移学习的方法的优势。我们提出的方法在随机性和任务动态上都表现出了强健性,在改变任务动态上也存在更强性。最后,我们证明我们采用的数据方法比现有方法更能改进。我们的数据方法。