Detecting objects and estimating their pose remains as one of the major challenges of the computer vision research community. There exists a compromise between localizing the objects and estimating their viewpoints. The detector ideally needs to be view-invariant, while the pose estimation process should be able to generalize towards the category-level. This work is an exploration of using deep learning models for solving both problems simultaneously. For doing so, we propose three novel deep learning architectures, which are able to perform a joint detection and pose estimation, where we gradually decouple the two tasks. We also investigate whether the pose estimation problem should be solved as a classification or regression problem, being this still an open question in the computer vision community. We detail a comparative analysis of all our solutions and the methods that currently define the state of the art for this problem. We use PASCAL3D+ and ObjectNet3D datasets to present the thorough experimental evaluation and main results. With the proposed models we achieve the state-of-the-art performance in both datasets.
翻译:检测对象和估计其外形仍然是计算机视觉研究界的主要挑战之一。 在定位对象和估计其观点之间存在着一种妥协。 探测器最好必须是视觉变化型的, 而表面估计过程应该能够概括到分类层面。 这项工作是探索如何同时使用深层学习模型来解决这两个问题。 为此,我们建议了三个新的深层学习结构,这些结构能够进行联合探测并作出估计,让我们逐渐分解这两个任务。 我们还调查构成的估计问题是否应该作为分类或回归问题加以解决,在计算机视觉界仍是一个未决问题。 我们详细分析了我们所有解决方案的比较分析,以及目前确定这一问题的艺术状态的方法。 我们使用PASAL3D+和ObectNet3D数据集来介绍彻底的实验评估和主要结果。 我们利用拟议的模型在两个数据集中都取得了最新业绩。