As we move towards large-scale object detection, it is unrealistic to expect annotated training data for all object classes at sufficient scale, and so methods capable of unseen object detection are required. We propose a novel zero-shot method based on training an end-to-end model that fuses semantic attribute prediction with visual features to propose object bounding boxes for seen and unseen classes. While we utilize semantic features during training, our method is agnostic to semantic information for unseen classes at test-time. Our method retains the efficiency and effectiveness of YOLO for objects seen during training, while improving its performance for novel and unseen objects. The ability of state-of-art detection methods to learn discriminative object features to reject background proposals also limits their performance for unseen objects. We posit that, to detect unseen objects, we must incorporate semantic information into the visual domain so that the learned visual features reflect this information and leads to improved recall rates for unseen objects. We test our method on PASCAL VOC and MS COCO dataset and observed significant improvements on the average precision of unseen classes.
翻译:当我们转向大规模物体探测时,期望为所有物体类别提供足够规模的附加说明的培训数据是不切实际的,因此需要能够探测不可见物体的方法。我们提出一种新的零射方法,其依据是培训一个端到端模型,该模型将带有视觉特征的语义属性预测结合成视觉特性,以提出视觉和看不见的种类的物体捆绑箱。在培训期间,我们使用语义特征,我们的方法对试验时的看不见种类的语义信息是不可知的。我们的方法保留了在训练期间所看到物体的YOLO的效率和有效性,同时改进了它对于新颖和看不见物体的性能。最先进的探测方法学习歧视对象特征以拒绝背景建议的能力也限制了它们对于看不见物体的性能。我们假设,为了探测看不见的物体,我们必须将语义信息纳入视觉领域,以便所学的视觉特征反映这种信息,并导致改进对看不见物体的回溯率。我们用在PASAL VOC和MS COCO数据集上测试了我们的方法,并观察到了对看不见类平均精确度的显著改进。