Given a video captured from a first person perspective and recorded in a familiar environment, can we recognize what the person is doing and identify where the action occurs in the 3D space? We address this challenging problem of jointly recognizing and localizing actions of a mobile user on a known 3D map from egocentric videos. To this end, we propose a novel deep probabilistic model. Our model takes the inputs of a Hierarchical Volumetric Representation (HVR) of the environment and an egocentric video, infers the 3D action location as a latent variable, and recognizes the action based on the video and contextual cues surrounding its potential locations. To evaluate our model, we conduct extensive experiments on a newly collected egocentric video dataset, in which both human naturalistic actions and photo-realistic 3D environment reconstructions are captured. Our method demonstrates strong results on both action recognition and 3D action localization across seen and unseen environments. We believe our work points to an exciting research direction in the intersection of egocentric vision, and 3D scene understanding.
翻译:根据从第一人的角度拍摄并在熟悉的环境中录制的视频,我们能否认识一个人在做什么,并确定3D空间的行动在哪里发生?我们从自我中心视频的已知的 3D 地图上共同识别移动用户的行动并将其定位这一具有挑战性的问题?为此,我们提出一个新的深层次概率模型。我们的模型采用环境的高度量子代表(HVR)和以自我为中心的视频的投入,将3D行动位置推断为潜伏变量,并承认基于其潜在位置的视频和背景线索的行动。为了评估我们的模型,我们对新收集的以自我为中心的视频数据集进行了广泛的实验,其中记录了人类自然行动和摄影现实的3D环境重建。我们的方法展示了在可见和看不见环境中的行动识别和3D行动定位的强有力结果。我们相信我们的工作点有助于在自我中心愿景和3D场理解的交叉点上找到令人兴奋的研究方向。