In this paper, we study a problem of egocentric scene understanding, i.e., predicting depths and surface normals from an egocentric image. Egocentric scene understanding poses unprecedented challenges: (1) due to large head movements, the images are taken from non-canonical viewpoints (i.e., tilted images) where existing models of geometry prediction do not apply; (2) dynamic foreground objects including hands constitute a large proportion of visual scenes. These challenges limit the performance of the existing models learned from large indoor datasets, such as ScanNet and NYUv2, which comprise predominantly upright images of static scenes. We present a multimodal spatial rectifier that stabilizes the egocentric images to a set of reference directions, which allows learning a coherent visual representation. Unlike unimodal spatial rectifier that often produces excessive perspective warp for egocentric images, the multimodal spatial rectifier learns from multiple directions that can minimize the impact of the perspective warp. To learn visual representations of the dynamic foreground objects, we present a new dataset called EDINA (Egocentric Depth on everyday INdoor Activities) that comprises more than 500K synchronized RGBD frames and gravity directions. Equipped with the multimodal spatial rectifier and the EDINA dataset, our proposed method on single-view depth and surface normal estimation significantly outperforms the baselines not only on our EDINA dataset, but also on other popular egocentric datasets, such as First Person Hand Action (FPHA) and EPIC-KITCHENS.
翻译:在本文中,我们研究一个以自我为中心的场景理解问题,即从以自我为中心的图像中预测深度和表面正常度。以自我为中心的场景理解提出了前所未有的挑战:(1) 由于大头运动,图像是从现有几何预测模型不适用的非卡门观点(即倾斜图像)中采集的;(2) 动态前方物体包括双手构成视觉场景的一大比例。这些挑战限制了从大型室内数据集(如扫描网和NYUv2)中学习的现有模型的性能,这些模型主要由静态场景的直观图像组成。我们展示了一个多式空间校正仪,将以自我为中心的图像稳定到一套参考方向,从而能够学习一致的视觉代表。与通常为以自我中心图像产生过度观点扭曲的不单式空间校正图像不同,多式联运空间校正仪从多个方向中学习,这些方向可以最大限度地减少观点扭曲的影响。为了了解对地面物体的动态物体的直观表现,我们展示了一个新的数据集,即EDINA(在日常内部活动的直径直径直处)将自我中心图像稳定到一套参考方向,这里面的直径直径直径直径直径直径直径直径直径直径直径直的Emar-直径直径直径直径直径直径直径直的图像。