While pre-trained visual representations have significantly advanced imitation learning, they are often task-agnostic as they remain frozen during policy learning. In this work, we explore leveraging pre-trained text-to-image diffusion models to obtain task-adaptive visual representations for robotic control, without fine-tuning the model itself. However, we find that naively applying textual conditions - a successful strategy in other vision domains - yields minimal or even negative gains in control tasks. We attribute this to the domain gap between the diffusion model's training data and robotic control environments, leading us to argue for conditions that consider the specific, dynamic visual information required for control. To this end, we propose ORCA, which introduces learnable task prompts that adapt to the control environment and visual prompts that capture fine-grained, frame-specific details. Through facilitating task-adaptive representations with our newly devised conditions, our approach achieves state-of-the-art performance on various robotic control benchmarks, significantly surpassing prior methods.
翻译:尽管预训练的视觉表示已显著推动了模仿学习的发展,但这些表示通常在策略学习过程中保持冻结状态,因此往往是任务无关的。在本研究中,我们探索如何利用预训练的文本到图像扩散模型来获取适用于机器人控制的任务自适应视觉表示,而无需对模型本身进行微调。然而,我们发现,直接应用文本条件——这一在其他视觉领域成功的策略——在控制任务中带来的增益微乎其微,甚至可能产生负面影响。我们将此归因于扩散模型训练数据与机器人控制环境之间的领域差异,从而主张应考虑控制所需的具体、动态视觉信息来设计条件。为此,我们提出了ORCA方法,该方法引入了可学习的任务提示以适应控制环境,以及能够捕捉细粒度、帧级细节的视觉提示。通过我们新设计的条件促进任务自适应表示的形成,我们的方法在多个机器人控制基准测试中实现了最先进的性能,显著超越了先前的方法。