The adoption of pre-trained visual representations (PVRs), leveraging features from large-scale vision models, has become a popular paradigm for training visuomotor policies. However, these powerful representations can encode a broad range of task-irrelevant scene information, making the resulting trained policies vulnerable to out-of-domain visual changes and distractors. In this work we address visuomotor policy feature pooling as a solution to the observed lack of robustness in perturbed scenes. We achieve this via Attentive Feature Aggregation (AFA), a lightweight, trainable pooling mechanism that learns to naturally attend to task-relevant visual cues, ignoring even semantically rich scene distractors. Through extensive experiments in both simulation and the real world, we demonstrate that policies trained with AFA significantly outperform standard pooling approaches in the presence of visual perturbations, without requiring expensive dataset augmentation or fine-tuning of the PVR. Our findings show that ignoring extraneous visual information is a crucial step towards deploying robust and generalisable visuomotor policies. Project Page: tsagkas.github.io/afa
翻译:利用大规模视觉模型特征的预训练视觉表征已成为训练视觉运动策略的主流范式。然而,这些强大的表征可能编码大量与任务无关的场景信息,导致训练出的策略在面对域外视觉变化和干扰物时表现脆弱。本研究将视觉运动策略的特征池化作为解决扰动场景中鲁棒性不足的方案。我们通过注意力特征聚合实现这一目标——这是一种轻量级、可训练的池化机制,能够自主学习关注任务相关的视觉线索,即使面对语义丰富的场景干扰物也能有效忽略。通过在仿真环境与真实世界中的大量实验,我们证明采用AFA训练的策略在视觉扰动存在时显著优于标准池化方法,且无需昂贵的数据集增强或PVR微调。我们的研究结果表明,忽略无关视觉信息是实现鲁棒且可泛化视觉运动策略部署的关键步骤。项目页面:tsagkas.github.io/afa