多相机无人监督的多摄像多摄像多域适应性管道,通过反向学习和自我培训在文化场所进行物体探测 (A Multi Camera Unsupervised Domain Adaptation Pipeline for Object Detection in Cultural Sites through Adversarial Learning and Self-Training)

Object detection algorithms allow to enable many interesting applications which can be implemented in different devices, such as smartphones and wearable devices. In the context of a cultural site, implementing these algorithms in a wearable device, such as a pair of smart glasses, allow to enable the use of augmented reality (AR) to show extra information about the artworks and enrich the visitors' experience during their tour. However, object detection algorithms require to be trained on many well annotated examples to achieve reasonable results. This brings a major limitation since the annotation process requires human supervision which makes it expensive in terms of time and costs. A possible solution to reduce these costs consist in exploiting tools to automatically generate synthetic labeled images from a 3D model of the site. However, models trained with synthetic data do not generalize on real images acquired in the target scenario in which they are supposed to be used. Furthermore, object detectors should be able to work with different wearable devices or different mobile devices, which makes generalization even harder. In this paper, we present a new dataset collected in a cultural site to study the problem of domain adaptation for object detection in the presence of multiple unlabeled target domains corresponding to different cameras and a labeled source domain obtained considering synthetic images for training purposes. We present a new domain adaptation method which outperforms current state-of-the-art approaches combining the benefits of aligning the domains at the feature and pixel level with a self-training process. We release the dataset at the following link https://iplab.dmi.unict.it/OBJ-MDA/ and the code of the proposed architecture at https://github.com/fpv-iplab/STMDA-RetinaNet.

翻译：对象检测算法可以使许多有趣的应用程序能够在不同设备中实施,例如智能手机和可磨损设备。在文化网站的背景下,在可磨损设备中实施这些算法,例如一副智能眼镜,允许使用强化现实(AR)显示更多关于艺术作品的信息,并丰富参观者在参观期间的经历。然而,对象检测算法需要在许多附加注释的例子上接受培训,才能取得合理的结果。这带来了重大限制,因为注释过程需要人文监督,因此在时间和成本方面费用昂贵。降低这些成本的可能解决方案包括利用链接工具,从网站的3D模型中自动生成合成标签图像。然而,经过合成数据培训的模型不能概括在目标情景中获取的真实图像。此外,对象检测器应该能够使用不同的可磨损设备或不同的移动设备开展工作,这样就更难概括化。在本文中,我们展示了一个新的数据集集,以研究在多个域域域域内进行目标检测的域域域域域域内测试问题,在多个域域域域域域内,S-我们用一个未标签的域域域域图进行升级的域际图像调整。