The core problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes. Previous approaches train the feature pyramid and detection head from scratch, which breaks the vision-text feature alignment established during pretraining, and struggles to prevent the language model from forgetting unseen classes. We propose three methods to alleviate these issues. Firstly, a simple scheme is used to augment the text embeddings which prevents overfitting to a small number of classes seen during training, while simultaneously saving memory and computation. Secondly, the feature pyramid network and the detection head are modified to include trainable gated shortcuts, which encourages vision-text feature alignment and guarantees it at the start of detection training. Finally, a self-training approach is used to leverage a larger corpus of image-text pairs thus improving detection performance on classes with no human annotated bounding boxes. Our three methods are evaluated on the zero-shot version of the LVIS benchmark, each of them showing clear and significant benefits. Our final network achieves the new stateof-the-art on the mAP-all metric and demonstrates competitive performance for mAP-rare, as well as superior transfer to COCO and Objects365.
翻译:在零样本开放词汇检测中,核心问题是如何对齐视觉和文本特征,以便检测器在未见类别上表现良好。先前的方法从头开始训练特征金字塔和检测头,打破了预训练期间建立的视觉-文本特征对齐,并难以防止语言模型忘记未见的类别。我们提出了三种方法来缓解这些问题。首先,使用简单的方案来增强文本嵌入,防止过度拟合到训练期间看到的少量类别,同时节省内存和计算。其次,修改特征金字塔网络和检测头以包括可训练的门控捷径,这有助于鼓励视觉-文本特征对齐,并确保在检测训练开始时完成对齐。最后,使用自我训练方法来利用更大的图像-文本对语料库,从而提高在没有人工标注边界框类别上的检测性能。我们对LVIS基准测试的零样本版本评估了我们的三种方法,每种方法都显示出明显且显著的优势。我们的最终网络在mAP-all度量上达到了最新的最佳性能,并展示了对COCO和Objects365的优越转移。