EC^2: Emergent Communication for Embodied Control (EC^2: Emergent Communication for Embodied Control)

Embodied control requires agents to leverage multi-modal pre-training to quickly learn how to act in new environments, where video demonstrations contain visual and motion details needed for low-level perception and control, and language instructions support generalization with abstract, symbolic structures. While recent approaches apply contrastive learning to force alignment between the two modalities, we hypothesize better modeling their complementary differences can lead to more holistic representations for downstream adaption. To this end, we propose Emergent Communication for Embodied Control (EC^2), a novel scheme to pre-train video-language representations for few-shot embodied control. The key idea is to learn an unsupervised "language" of videos via emergent communication, which bridges the semantics of video details and structures of natural language. We learn embodied representations of video trajectories, emergent language, and natural language using a language model, which is then used to finetune a lightweight policy network for downstream control. Through extensive experiments in Metaworld and Franka Kitchen embodied benchmarks, EC^2 is shown to consistently outperform previous contrastive learning methods for both videos and texts as task inputs. Further ablations confirm the importance of the emergent language, which is beneficial for both video and language learning, and significantly superior to using pre-trained video captions. We also present a quantitative and qualitative analysis of the emergent language and discuss future directions toward better understanding and leveraging emergent communication in embodied tasks.

翻译：EC^2：面向具身控制的新兴通信技术，要求智能体利用多模态预训练技术快速学习如何在新环境中行动，其中视频演示包含需要进行低级别感知和控制的视觉和运动细节，而语言命令则支持具有抽象、符号结构的泛化。尽管最近的方法应用对比学习来强制两种模态之间的对齐，但我们认为更好地建模它们互补的差异可以导致下游适应所需的更全面的表示。为此，我们提出了面向具身控制的新兴通信技术（EC^2），一种用于预训练视频 - 语言表示以进行少样本具身控制的新颖方案。其核心思想是通过紧急通信学习视频的无监督“语言”，这种语言连接了视频细节的语义和自然语言的结构。我们使用语言模型学习视频轨迹、紧急语言和自然语言的具体表示，并通过微调轻量级策略网络进行下游控制。通过在Metaworld和Franka厨房具身基准测试中进行大量实验，EC^2显示出比以前的两种使用视频和文本作为任务输入的对比学习方法更加卓越的性能。进一步的下实验证实了紧急通信的重要性，这对于视频和语言的学习都是有益的，并且明显优于使用预先训练的视频标题。我们还提供了紧急语言的定量和定性分析，并讨论了更好地理解和利用具身任务中的新兴通信的未来方向。