Egocentric augmented reality devices such as wearable glasses passively capture visual data as a human wearer tours a home environment. We envision a scenario wherein the human communicates with an AI agent powering such a device by asking questions (e.g., where did you last see my keys?). In order to succeed at this task, the egocentric AI assistant must (1) construct semantically rich and efficient scene memories that encode spatio-temporal information about objects seen during the tour and (2) possess the ability to understand the question and ground its answer into the semantic memory representation. Towards that end, we introduce (1) a new task - Episodic Memory Question Answering (EMQA) wherein an egocentric AI assistant is provided with a video sequence (the tour) and a question as an input and is asked to localize its answer to the question within the tour, (2) a dataset of grounded questions designed to probe the agent's spatio-temporal understanding of the tour, and (3) a model for the task that encodes the scene as an allocentric, top-down semantic feature map and grounds the question into the map to localize the answer. We show that our choice of episodic scene memory outperforms naive, off-the-shelf solutions for the task as well as a host of very competitive baselines and is robust to noise in depth, pose as well as camera jitter. The project page can be found at: https://samyak-268.github.io/emqa .
翻译:以自我为中心的强化现实装置, 如戴眼镜被动拍摄视觉数据作为人类磨损者巡视家庭环境。 我们设想了一个情景, 人类与AI代理进行交流, 通过询问( 例如, 你上次看到我的钥匙在哪里? ) 来驱动这个任务。 为了成功完成这个任务, 以自我为中心的AI助理必须(1) 构建精致丰富和高效的现场记忆, 将关于旅游期间所看到的物体的信息编码成spatio- 时空信息, (2) 有能力理解问题并将其答案纳入语义记忆表。 到此结束, 我们引入了 (1) 新的任务 - Episodic Remember Resplogy 答( EMQA), 向一个以自我为中心的AI助理提供视频序列( 即你最后一次看到我的钥匙? ) 并且作为一个问题, 并被要求将其在旅游期间的问题的答案本地化, (2) 一套基础问题的数据集, 用来检测代理人对旅游的瞬色- 时间轴理解, 以及(3) 一个将现场解决方案编码成一个全局、 上自上下来的存储的图像选择, 将一个本地的图像定位定位定位定位作为一个任务, 将一个在地图上, 以显示一个真实的缩缩缩缩缩缩的定位的定位的定位, 任务, 以显示成一个地图, 的缩缩缩缩成成一个地图, 的缩缩成一个地图, 的定位的定位的定位的定位的定位的定位, 任务, 。