With the recent advances in video and 3D understanding, novel 4D spatio-temporal challenges fusing both concepts have emerged. Towards this direction, the Ego4D Episodic Memory Benchmark proposed a task for Visual Queries with 3D Localization (VQ3D). Given an egocentric video clip and an image crop depicting a query object, the goal is to localize the 3D position of the center of that query object with respect to the camera pose of a query frame. Current methods tackle the problem of VQ3D by lifting the 2D localization results of the sister task Visual Queries with 2D Localization (VQ2D) into a 3D reconstruction. Yet, we point out that the low number of Queries with Poses (QwP) from previous VQ3D methods severally hinders their overall success rate and highlights the need for further effort in 3D modeling to tackle the VQ3D task. In this work, we formalize a pipeline that better entangles 3D multiview geometry with 2D object retrieval from egocentric videos. We estimate more robust camera poses, leading to more successful object queries and substantially improved VQ3D performance. In practice, our method reaches a top-1 overall success rate of 86.36% on the Ego4D Episodic Memory Benchmark VQ3D, a 10x improvement over the previous state-of-the-art. In addition, we provide a complete empirical study highlighting the remaining challenges in VQ3D.
翻译:随着视频和3D理解的最近进展,出现了新的4Dspatio-时空挑战,这两个概念都出现了。朝着这个方向,Ego4D Episodic记忆基准建议用 3D 本地化 (VQ3D) 进行视觉查询。鉴于一个以自我为中心的视频剪辑和一个描绘查询对象的图像作物,目标是将该查询对象中心的3D位置与一个查询框架的相机布局相对应的地方化。当前的方法解决VQ3D问题,方法是将姐妹任务2D 本地化的2D 本地化测试结果与2D 本地化(VQ2D) 相加到3D 重建中。然而,我们指出,与先前VQ3D 方法相比,与Poseses(QP) 的质疑数量较少,这妨碍了他们的总体成功率,并突显了在3D 建模中进一步努力解决VQ3D任务的必要性。在这项工作中,我们正式确定了一条更好的管道,将3D 多视对象的升级与2D 本地化目标对2D 的完整校正读Q 3 3D 校程视频视频视频视频视频视频视频视频视频视频视频视频视频视频视频视频的检索,在前的升级上,我们有一个更强大的的升级的升级的EQ3D 。我们估计了比高的高级的EQD 。