We pose a new question: Can agents learn how to combine actions from previous tasks to complete new tasks, just as humans? In contrast to imitation learning, there is no expert data, only the data collected through environmental exploration. Compared with offline reinforcement learning, the problem of data distribution shift is more serious. Since the action sequence to solve the new task may be the combination of trajectory segments of multiple training tasks, in other words, the test task and the solving strategy do not exist directly in the training data. This makes the problem more difficult. We propose a Memory-related Multi-task Method (M3) to address this problem. The method consists of three stages. First, task-agnostic exploration is carried out to collect data. Different from previous methods, we organize the exploration data into a knowledge graph. We design a model based on the exploration data to extract action effect features and save them in memory, while an action predictive model is trained. Secondly, for a new task, the action effect features stored in memory are used to generate candidate actions by a feature decomposition-based approach. Finally, a multi-scale candidate action pool and the action predictive model are fused to generate a strategy to complete the task. Experimental results show that the performance of our proposed method is significantly improved compared with the baseline.
翻译:我们提出了一个新问题:代理商能否学习如何将先前任务中的行动结合起来来完成新任务,就像人类一样? 与模仿学习相比,没有专家数据,只有通过环境勘探收集的数据。 与离线强化学习相比,数据分布变化的问题更为严重。 由于解决新任务的行动顺序可能是多种培训任务的轨迹部分的结合, 换句话说, 测试任务和解决战略并不直接存在于培训数据中。 这就使得问题更难解决。 我们提出了一种与记忆有关的多任务方法( M3)来解决这个问题。 这种方法由三个阶段组成。 首先, 进行任务不可知性的探索以收集数据。 我们将勘探数据组织成一个知识图表。 我们根据勘探数据设计一个模型, 以提取行动效果特性并将其保存在记忆中, 换句话说, 行动预测模型是直接存在于记忆中的行动效果, 用于通过基于特性的分解定位方法生成候选行动。 最后, 一个多尺度的候选行动组群和动作预测模型与我们提议的实验性模型有显著的结合。