通过数据协会和图像序列有标记的地形估计,自行监督地学习3D对象了解3D对象 (Self-supervised Learning of 3D Object Understanding by Data Association and Landmark Estimation for Image Sequence)

In this paper, we propose a self-supervised learningmethod for multi-object pose estimation. 3D object under-standing from 2D image is a challenging task that infers ad-ditional dimension from reduced-dimensional information.In particular, the estimation of the 3D localization or orien-tation of an object requires precise reasoning, unlike othersimple clustering tasks such as object classification. There-fore, the scale of the training dataset becomes more cru-cial. However, it is challenging to obtain large amount of3D dataset since achieving 3D annotation is expensive andtime-consuming. If the scale of the training dataset can beincreased by involving the image sequence obtained fromsimple navigation, it is possible to overcome the scale lim-itation of the dataset and to have efficient adaptation tothe new environment. However, when the self annotation isconducted on single image by the network itself, trainingperformance of the network is bounded to the self perfor-mance. Therefore, we propose a strategy to exploit multipleobservations of the object in the image sequence in orderto surpass the self-performance: first, the landmarks for theglobal object map are estimated through network predic-tion and data association, and the corrected annotation fora single frame is obtained. Then, network fine-tuning is con-ducted including the dataset obtained by self-annotation,thereby exceeding the performance boundary of the networkitself. The proposed method was evaluated on the KITTIdriving scene dataset, and we demonstrate the performanceimprovement in the pose estimation of multi-object in 3D space.

翻译：在本文中,我们建议对多弹点进行自我监督的估算方法。 2D 图像显示的 3D 对象偏差是一个具有挑战性的任务,它从低尺寸信息中推断出离子尺寸的维度。特别是,对一个对象的 3D 本地化或二次化的估计需要精确的推理, 不同于其他简单的组合任务, 如对象分类。从此, 培训数据集的规模将变得更加粗糙。然而, 获取大量3D 数据集是困难的, 因为实现 3D 注释成本昂贵且耗时。如果培训数据集的比重比重能够从低尺寸信息中推断出离子尺寸的维度维度维度维度维度。特别是, 要克服数据集的3D 本地化规模, 并有效地适应新环境。然而, 当网络本身的单个图像显示时, 网络的训练性能与自我透视度有关。因此, 我们提出一个战略, 利用图像目标的多重观察性能度的多重观察度, 超越了自我评估的网络的自我评估过程, 将显示系统内部的精确度。