First-person video highlights a camera-wearer's activities in the context of their persistent environment. However, current video understanding approaches reason over visual features from short video clips that are detached from the underlying physical space and capture only what is immediately visible. We present an approach that links egocentric video and the environment by learning representations that are predictive of the camera-wearer's (potentially unseen) local surroundings to facilitate human-centric environment understanding. We train such models using videos from agents in simulated 3D environments where the environment is fully observable, and test them on human-captured real-world videos from unseen environments. On two human-centric video tasks, we show that state-of-the-art video models equipped with our environment-aware features consistently outperform their counterparts with traditional clip features. Moreover, despite being trained exclusively on simulated videos, our approach successfully handles real-world videos from HouseTours and Ego4D. Project page: https://vision.cs.utexas.edu/projects/ego-env/
翻译:首先人视频突出显示在持续环境中的摄像穿衣者活动。然而,目前视频理解度从与基本物理空间分离的短视频片段的短片片段的视觉特征中寻找理由,这些短片片段与基本物理空间脱钩,只捕捉立即可见的东西。我们展示了一种将自我中心视频和环境联系起来的方法,通过学习展示来预测摄像穿衣者(潜在看不见的)当地环境,以促进对以人为中心的环境的了解。我们利用模拟环境完全可观测的3D环境中的代理物视频来培训这些模型,并用人类从无形环境中捕捉到的现实世界视频进行测试。在两项以人为中心的视频任务中,我们展示了配备我们环境认知特征的先进视频模型,这些模型始终超越了具有传统剪辑功能的对应方。此外,尽管我们的方法仅受过模拟视频培训,但我们成功地处理了HousTours和Ego4D的实世视频。项目网页:https://vision.cs.utexas.educas/production/ego-env/emental/ego-env。