We study the problem of translating an image-based, step-by-step assembly manual created by human designers into machine-interpretable instructions. We formulate this problem as a sequential prediction task: at each step, our model reads the manual, locates the components to be added to the current shape, and infers their 3D poses. This task poses the challenge of establishing a 2D-3D correspondence between the manual image and the real 3D object, and 3D pose estimation for unseen 3D objects, since a new component to be added in a step can be an object built from previous steps. To address these two challenges, we present a novel learning-based framework, the Manual-to-Executable-Plan Network (MEPNet), which reconstructs the assembly steps from a sequence of manual images. The key idea is to integrate neural 2D keypoint detection modules and 2D-3D projection algorithms for high-precision prediction and strong generalization to unseen components. The MEPNet outperforms existing methods on three newly collected LEGO manual datasets and a Minecraft house dataset.
翻译:我们研究将人类设计师制作的基于图像的、逐步的组装手册转换成机器解释指令的问题。我们将此问题设计成一个连续的预测任务:在每一步,我们的模型都会阅读手册,找到要添加到当前形状的组件,并推断其3D构成。这项任务提出了在手动图像与真实的3D对象之间建立2D-3D对应关系的挑战,3D对看不见的3D对象构成估计,因为一个步骤中要添加的新组件可以是以前步骤中建立的一个对象。为了应对这两个挑战,我们提出了一个基于学习的新框架,即“手动到操作-计划网络”(MEPNet),它从手动图像序列中重建组装步骤。关键的想法是将神经2D关键点检测模块和2D-3D投影算法结合起来,用于高精度预测和对看不见组件的强力概括。MEPNet比新收集的3个专家组人工数据集和地雷分析室数据集的现有方法要强。