We introduce Look and Tell, a multimodal dataset for studying referential communication across egocentric and exocentric perspectives. Using Meta Project Aria smart glasses and stationary cameras, we recorded synchronized gaze, speech, and video as 25 participants instructed a partner to identify ingredients in a kitchen. Combined with 3D scene reconstructions, this setup provides a benchmark for evaluating how different spatial representations (2D vs. 3D; ego vs. exo) affect multimodal grounding. The dataset contains 3.67 hours of recordings, including 2,707 richly annotated referential expressions, and is designed to advance the development of embodied agents that can understand and engage in situated dialogue.
翻译:本文提出‘观察与描述’数据集,用于研究自我中心与异中心视角间的跨模态指称交流。通过Meta Project Aria智能眼镜与固定摄像机,我们同步记录了25名参与者在厨房场景中指导同伴识别食材时的视线、语音及视频数据。结合三维场景重建,该数据集为评估不同空间表征(二维与三维;自我中心与异中心视角)对跨模态指称理解的影响提供了基准。数据集包含3.67小时的多模态记录,涵盖2,707条精细标注的指称表达式,旨在推动能够理解并参与情境对话的具身智能体研发。