Deep CNN-based object detection systems have achieved remarkable success on several large-scale object detection benchmarks. However, training such detectors requires a large number of labeled bounding boxes, which are more difficult to obtain than image-level annotations. Previous work addresses this issue by transforming image-level classifiers into object detectors. This is done by modeling the differences between the two on categories with both image-level and bounding box annotations, and transferring this information to convert classifiers to detectors for categories without bounding box annotations. We improve this previous work by incorporating knowledge about object similarities from visual and semantic domains during the transfer process. The intuition behind our proposed method is that visually and semantically similar categories should exhibit more common transferable properties than dissimilar categories, e.g. a better detector would result by transforming the differences between a dog classifier and a dog detector onto the cat class, than would by transforming from the violin class. Experimental results on the challenging ILSVRC2013 detection dataset demonstrate that each of our proposed object similarity based knowledge transfer methods outperforms the baseline methods. We found strong evidence that visual similarity and semantic relatedness are complementary for the task, and when combined notably improve detection, achieving state-of-the-art detection performance in a semi-supervised setting.
翻译:以CNN为基础的深层物体探测系统在若干大型物体探测基准上取得了显著成功。然而,培训这类探测器需要大量标签的捆绑框,比图像级别的注释更难获得。以前的工作通过将图像级分类器转换成物体探测器来解决这个问题。通过将两种类别之间的差异建模,既包括图像级的分类器,也包括捆绑式盒说明器,将这种信息转换成不附带框说明的分类器。我们改进了先前的工作,在传输过程中纳入了关于视觉和语义领域的物体相似性的知识。我们拟议方法背后的直觉是,视觉和语义相似的类别应显示比不同类别更常见的可转移性,例如,将更好的检测器通过将狗级分类器和狗检测器之间的差异转换到猫类,而不是从小提琴类中转化出来。具有挑战性的 ILSVRC2013 探测数据集的实验结果显示,我们提议的每个基于知识的相似性转移方法都超越了基线方法。我们发现,在视觉和语义相似性的检测中,视觉相似性和精准性的相关检测是显著地改进了工作状态。