Learning robotic tasks in the real world is still highly challenging and effective practical solutions remain to be found. Traditional methods used in this area are imitation learning and reinforcement learning, but they both have limitations when applied to real robots. Combining reinforcement learning with pre-collected demonstrations is a promising approach that can help in learning control policies to solve robotic tasks. In this paper, we propose an algorithm that uses novel techniques to leverage offline expert data using offline and online training to obtain faster convergence and improved performance. The proposed algorithm (AWET) weights the critic losses with a novel agent advantage weight to improve over the expert data. In addition, AWET makes use of an automatic early termination technique to stop and discard policy rollouts that are not similar to expert trajectories -- to prevent drifting far from the expert data. In an ablation study, AWET showed improved and promising performance when compared to state-of-the-art baselines on four standard robotic tasks.
翻译:在现实世界中,学习机器人的任务仍然极具挑战性,还有待找到有效的实际解决办法。该领域使用的传统方法是模仿学习和强化学习,但在应用到真正的机器人时,两者都有局限性。将强化学习与预收集的演示相结合是一种很有希望的方法,有助于学习控制政策,解决机器人的任务。在本文中,我们建议采用一种算法,利用离线和在线培训利用离线专家数据,以获得更快的趋同和改进性能。拟议的算法(AWET)将批评性损失与新颖的代理商优势权重加权起来,以改进专家数据。此外,AWET还利用自动早期终止技术来阻止和抛弃与专家轨迹不相类似的政策推出,以防止与专家数据相去的远处。在一项膨胀研究中,AWET在四种标准机器人任务上,与最先进的基线相比,表现更好,有希望。