Recent work in visual representation learning for robotics demonstrates the viability of learning from large video datasets of humans performing everyday tasks. Leveraging methods such as masked autoencoding and contrastive learning, these representations exhibit strong transfer to policy learning for visuomotor control. But, robot learning encompasses a diverse set of problems beyond control including grasp affordance prediction, language-conditioned imitation learning, and intent scoring for human-robot collaboration, amongst others. First, we demonstrate that existing representations yield inconsistent results across these tasks: masked autoencoding approaches pick up on low-level spatial features at the cost of high-level semantics, while contrastive learning approaches capture the opposite. We then introduce Voltron, a framework for language-driven representation learning from human videos and associated captions. Voltron trades off language-conditioned visual reconstruction to learn low-level visual patterns, and visually-grounded language generation to encode high-level semantics. We also construct a new evaluation suite spanning five distinct robot learning problems $\unicode{x2013}$ a unified platform for holistically evaluating visual representations for robotics. Through comprehensive, controlled experiments across all five problems, we find that Voltron's language-driven representations outperform the prior state-of-the-art, especially on targeted problems requiring higher-level features.
翻译:机器人视觉代表学习的近期工作显示了从大型视频数据组中学习完成日常任务的人类的大型视频数据集中学习的可行性。 利用隐藏自动编码和对比式学习等方法,这些演示展示展示出强大的向政策学习的转移,以进行相对摩托控制。 但是,机器人学习包含一系列无法控制的多种问题,包括:掌握价格预测、语言条件模拟学习和人类机器人合作的意向评分等。 首先,我们展示了现有演示在这些任务中产生不一致的结果:蒙面自动编码方法以高层次语义学为代价,在低层次空间特征上采集,而对比式学习方法则能捕捉反面。 然后我们引入了Voltron,这是一个用语言驱动的代表学习人类视频和相关字幕的框架。 Voltron将基于语言的视觉重建与低层次视觉模式进行交易,以及视觉背景语言生成以编码高层次语义学。 我们还建造了一套新的评价套房,覆盖五个不同的机器人学习问题 $\unco{x2013}一个统一平台,用于全面评价高层次的图像展示,特别是高层次的机器人在之前的状态上发现问题。</s>