Task specification is at the core of programming autonomous robots. A low-effort modality for task specification is critical for engagement of non-expert end-users and ultimate adoption of personalized robot agents. A widely studied approach to task specification is through goals, using either compact state vectors or goal images from the same robot scene. The former is hard to interpret for non-experts and necessitates detailed state estimation and scene understanding. The latter requires the generation of desired goal image, which often requires a human to complete the task, defeating the purpose of having autonomous robots. In this work, we explore alternate and more general forms of goal specification that are expected to be easier for humans to specify and use such as images obtained from the internet, hand sketches that provide a visual description of the desired task, or simple language descriptions. As a preliminary step towards this, we investigate the capabilities of large scale pre-trained models (foundation models) for zero-shot goal specification, and find promising results in a collection of simulated robot manipulation tasks and real-world datasets.
翻译:任务规格是编程自主机器人的核心。任务规格的低努力方式对于非专家终端用户的参与和最终采用个性化机器人代理物来说至关重要。 广泛研究的任务规格方法是通过目标,使用紧凑状态矢量或同一机器人场景的目标图像。 前者很难为非专家解释,需要详细的国家估计和场景理解。 后者需要生成理想的目标图像,这往往需要一个人完成任务,从而挫败拥有自主机器人的目的。 在这项工作中,我们探索替代和更一般的目标规格方式,预计这些方式将更容易为人类指定和使用从互联网获得的图像、提供所需任务直观描述的手动草图或简单的语言描述等。 作为朝此方向迈出的初步步骤,我们调查了大规模预先训练的模型(基本模型)在零点目标规格方面的能力,并在收集模拟机器人操纵任务和现实世界数据集的过程中找到有希望的结果。