Narrated instructional videos often show and describe manipulations of similar objects, e.g., repairing a particular model of a car or laptop. In this work we aim to reconstruct such objects and to localize associated narrations in 3D. Contrary to the standard scenario of instance-level 3D reconstruction, where identical objects or scenes are present in all views, objects in different instructional videos may have large appearance variations given varying conditions and versions of the same product. Narrations may also have large variation in natural language expressions. We address these challenges by three contributions. First, we propose an approach for correspondence estimation combining learnt local features and dense flow. Second, we design a two-step divide and conquer reconstruction approach where the initial 3D reconstructions of individual videos are combined into a 3D alignment graph. Finally, we propose an unsupervised approach to ground natural language in obtained 3D reconstructions. We demonstrate the effectiveness of our approach for the domain of car maintenance. Given raw instructional videos and no manual supervision, our method successfully reconstructs engines of different car models and associates textual descriptions with corresponding objects in 3D.
翻译:叙述式教学录像往往显示和描述类似物体的操纵,例如修理汽车或膝上型号的特定模型或膝上型计算机。在这项工作中,我们的目标是重建这些物体并将相关的叙述定位为3D。 与3D重建的标准情景相反,所有观点中都存在相同的物体或场景,不同教学录像中的物体由于同一产品的条件和版本不同,其外观可能差异很大。叙述式在自然语言表达方式上也可能有很大差异。我们用三种贡献来应对这些挑战。首先,我们建议用三种贡献来应对这些挑战。首先,我们建议用一种方法进行通信估计,将所学到的地方特征和密集流量结合起来。第二,我们设计了一种两步分化和征服重建方法,将单个视频最初的3D重建合并为3D对齐图。最后,我们提出一种不统一的方法,在获得3D重建时将自然语言置于地面上。我们展示了我们维护汽车领域的方法的有效性。鉴于原始指导性视频和没有手动监督,我们的方法成功地重建了不同汽车模型的引擎和与3D中相应对象的文本描述。