State representation learning (SRL) in partially observable Markov decision processes has been studied to learn abstract features of data useful for robot control tasks. For SRL, acquiring domain-agnostic states is essential for achieving efficient imitation learning. Without these states, imitation learning is hampered by domain-dependent information useless for control. However, existing methods fail to remove such disturbances from the states when the data from experts and agents show large domain shifts. To overcome this issue, we propose a domain-adversarial and conditional state space model (DAC-SSM) that enables control systems to obtain domain-agnostic and task- and dynamics-aware states. DAC-SSM jointly optimizes the state inference, observation reconstruction, forward dynamics, and reward models. To remove domain-dependent information from the states, the model is trained with domain discriminators in an adversarial manner, and the reconstruction is conditioned on domain labels. We experimentally evaluated the model predictive control performance via imitation learning for continuous control of sparse reward tasks in simulators and compared it with the performance of the existing SRL method. The agents from DAC-SSM achieved performance comparable to experts and more than twice the baselines. We conclude domain-agnostic states are essential for imitation learning that has large domain shifts and can be obtained using DAC-SSM.
翻译:在部分可见的Markov决策过程中,国家代表学习(SRL)是为了学习可用于机器人控制任务的数据的抽象特征。对于SRL来说,获得域不可知状态对于实现高效的模仿学习至关重要。没有这些状态,模仿学习会受到依赖域的信息的阻碍,无法加以控制。然而,当专家和代理人的数据显示出巨大的域位变化时,现有方法无法消除各州的这种干扰。为了克服这一问题,我们提议了一种域对域和有条件的国家空间模型(DAC-SSM),使控制系统能够获得对域不可知的、任务和动态觉悟的数据。DAC-SSM联合优化了国家推断、观察重建、前方动态和奖励模式。要从各州删除依赖域的信息,该模型以对抗方式与域歧视者培训,而重建则以域名标签为条件。我们试验性地对模型的预测性控制绩效进行了评估,通过模拟器持续控制稀有的奖励任务,并将其与现有的SRL方法的绩效进行比较。DAC-SM-SSM的代理人共同优化了状态、观察、观察重建、前期动态动态动态和奖励模式的模级模型,我们可以两次进行实地学习。