Robots will experience non-stationary environment dynamics throughout their lifetime: the robot dynamics can change due to wear and tear, or its surroundings may change over time. Eventually, the robots should perform well in all of the environment variations it has encountered. At the same time, it should still be able to learn fast in a new environment. We identify two challenges in Reinforcement Learning (RL) under such a lifelong learning setting with off-policy data: first, existing off-policy algorithms struggle with the trade-off between being conservative to maintain good performance in the old environment and learning efficiently in the new environment, despite keeping all the data in the replay buffer. We propose the Offline Distillation Pipeline to break this trade-off by separating the training procedure into an online interaction phase and an offline distillation phase.Second, we find that training with the imbalanced off-policy data from multiple environments across the lifetime creates a significant performance drop. We identify that this performance drop is caused by the combination of the imbalanced quality and size among the datasets which exacerbate the extrapolation error of the Q-function. During the distillation phase, we apply a simple fix to the issue by keeping the policy closer to the behavior policy that generated the data. In the experiments, we demonstrate these two challenges and the proposed solutions with a simulated bipedal robot walk-ing task across various environment changes. We show that the Offline Distillation Pipeline achieves better performance across all the encountered environments without affecting data collection. We also provide a comprehensive empirical study to support our hypothesis on the data imbalance issue.
翻译:机器人将在其一生中经历非静止的环境动态: 机器人动态会因磨损而改变, 或者其周围环境会随着时间的推移而改变。 最后, 机器人应该在其遇到的所有环境变异中表现良好 。 同时, 它应该仍然能够在新的环境中快速学习 。 我们发现在这种终身学习环境中, 与政策数据脱轨的终身学习环境相比, 强化学习( RL) 存在两个挑战 : 首先, 现有的离政策演算法会与权衡取舍相争, 一方面是保守的, 以保持旧环境中的良好性能, 而在新的环境中, 则会有效学习, 尽管在重放缓冲中保留所有数据 。 我们建议 脱线的蒸馏管道将这种变换换在它遇到的所有环境变异 。 在更接近的演化阶段中, 我们用一个简单的模拟性能数据采集阶段, 我们用一个更精确的模拟的 来显示一个更精确的演算 。 在更精确的演算中, 我们用更精确的演算来修正一个更精确的演算过程 。 在更精确的演算中, 我们用更精确的演算中, 我们用一个更精确的演算来修正一个简单的演算来修正一个更精确的演算任务 。