Meta-reinforcement learning typically requires orders of magnitude more samples than single task reinforcement learning methods. This is because meta-training needs to deal with more diverse distributions and train extra components such as context encoders. To address this, we propose a novel self-supervised learning task, which we named Trajectory Contrastive Learning (TCL), to improve meta-training. TCL adopts contrastive learning and trains a context encoder to predict whether two transition windows are sampled from the same trajectory. TCL leverages the natural hierarchical structure of context-based meta-RL and makes minimal assumptions, allowing it to be generally applicable to context-based meta-RL algorithms. It accelerates the training of context encoders and improves meta-training overall. Experiments show that TCL performs better or comparably than a strong meta-RL baseline in most of the environments on both meta-RL MuJoCo (5 of 6) and Meta-World benchmarks (44 out of 50).
翻译:元加强学习通常需要数量级的样本,而不是单项任务强化学习方法。这是因为元培训需要处理更加多样化的分布,并培训额外的组件,如环境编码器。为了解决这个问题,我们提议了一种新的自监督学习任务,我们称之为“轨迹对比学习”,以改进元培训。TCL采用对比学习,并训练了环境编码器,以预测两个过渡窗口是否从同一轨迹中抽样。TCL利用了基于环境的元RL的自然等级结构,并做出了最低的假设,使其能够普遍适用于基于环境的元RL算法。它加速了对环境编码器的培训,改进了元培训的总体。实验显示,在元-RL MuJoCo(5个)和Meta-World基准(50个中的44个)的大多数环境中,TCL的表现都比一个强大的元-RL基线更好或相容。