工作中的学习:从愿景中新连接商工业插入自回报离线至联机微调 (Learning on the Job: Self-Rewarding Offline-to-Online Finetuning for Industrial Insertion of Novel Connectors from Vision)

Learning-based methods in robotics hold the promise of generalization, but what can be done if a learned policy does not generalize to a new situation? In principle, if an agent can at least evaluate its own success (i.e., with a reward classifier that generalizes well even when the policy does not), it could actively practice the task and finetune the policy in this situation. We study this problem in the setting of industrial insertion tasks, such as inserting connectors in sockets and setting screws. Existing algorithms rely on precise localization of the connector or socket and carefully managed physical setups, such as assembly lines, to succeed at the task. But in unstructured environments such as homes or even some industrial settings, robots cannot rely on precise localization and may be tasked with previously unseen connectors. Offline reinforcement learning on a variety of connector insertion tasks is a potential solution, but what if the robot is tasked with inserting previously unseen connector? In such a scenario, we will still need methods that can robustly solve such tasks with online practice. One of the main observations we make in this work is that, with a suitable representation learning and domain generalization approach, it can be significantly easier for the reward function to generalize to a new but structurally similar task (e.g., inserting a new type of connector) than for the policy. This means that a learned reward function can be used to facilitate the finetuning of the robot's policy in situations where the policy fails to generalize in zero shot, but the reward function generalizes successfully. We show that such an approach can be instantiated in the real world, pretrained on 50 different connectors, and successfully finetuned to new connectors via the learned reward function. Videos can be viewed at https://sites.google.com/view/learningonthejob

翻译：机器人的基于学习的方法有概括化的希望, 但是如果学习的政策不向新情况概括化, 那么可以做些什么呢? 原则上, 如果一个代理至少能够评估自己的成功与否( 比如说, 奖励分类方法, 即使该政策并不普遍化 ), 它可以积极练习任务, 并在此情况下微调政策。我们研究在设置工业插入任务时的这一问题, 比如在插接器中插入连接器和设置螺旋。现有的算法依赖于连接器或连接器的精确本地化, 并仔细管理物理设置, 比如组装状况, 才能在任务中获得成功。但是, 在诸如家庭甚至某些工业环境等非结构化环境中, 机器人不能依靠精确的本地化分类, 并且可能由先前的未知连接器来完成。离线式的对连接任务进行学习, 但是如果机器人在设置之前的连接器能成功插入连接起来? 在这样的场景中, 我们仍然需要一些方法, 能够通过在线操作来快速地解决这些任务。在这种精细的操作中, 我们所做的一项主要观察是, 将它连接到一个较容易的功能,, 在一个结构化的功能中,, 用来学习和直观的功能,, 将一个普通的功能, 将用来在这样的转换成一个普通的路径,, 用来去,, 将一个普通的路径, 将一个普通的功能, 用来去用来去将一个普通的功能,, 将用来去到一个普通的连接到一个普通的功能,,,,, 将一个普通的操作式式式式式的功能,,,,,, 用来去, 的功能用来去用来去将一个在将新的的功能将将一个在将新的进行新的的的的的的的的的的将的的的的,,,, 的的的将的的的的的的将的的的的的的的的将的将将将将将的的将将将的的的