In this work, we study different approaches to self-supervised pretraining of object detection models. We first design a general framework to learn a spatially consistent dense representation from an image, by randomly sampling and projecting boxes to each augmented view and maximizing the similarity between corresponding box features. We study existing design choices in the literature, such as box generation, feature extraction strategies, and using multiple views inspired by its success on instance-level image representation learning techniques. Our results suggest that the method is robust to different choices of hyperparameters, and using multiple views is not as effective as shown for instance-level image representation learning. We also design two auxiliary tasks to predict boxes in one view from their features in the other view, by (1) predicting boxes from the sampled set by using a contrastive loss, and (2) predicting box coordinates using a transformer, which potentially benefits downstream object detection tasks. We found that these tasks do not lead to better object detection performance when finetuning the pretrained model on labeled data.
翻译:在这项工作中,我们研究对物体探测模型进行自我监督前训练的不同方法。我们首先设计一个总框架,从图像中,通过随机抽样和投射框到每个放大视图,从每个放大视图中学习空间上一致的密集表示,并尽可能扩大相应框特征之间的相似性。我们研究了文献中现有的设计选择,例如箱生成、特征提取策略,以及利用实例级图像代表教学技术的成功经验所激发的多种观点。我们的结果显示,该方法对超参数的不同选择是稳健的,而使用多个视图的效果不如实例级图像代表学习那样有效。我们还设计了两个辅助任务,从另一个视图中从一个视图中预测框的特征,其方法是(1) 通过使用对比性损失预测集的框,(2) 利用变压器预测箱坐标,这可能会有利于下游物体探测任务。我们发现,这些任务不会导致在对标签数据预先培训的模式进行微调时,对目标探测性能产生更好的效果。