Zero-shot object detection is an emerging research topic that aims to recognize and localize previously 'unseen' objects. This setting gives rise to several unique challenges, e.g., highly imbalanced positive vs. negative instance ratio, ambiguity between background and unseen classes and the proper alignment between visual and semantic concepts. Here, we propose an end-to-end deep learning framework underpinned by a novel loss function that puts more emphasis on difficult examples to avoid class imbalance. We call our objective the 'Polarity loss' because it explicitly maximizes the gap between positive and negative predictions. Such a margin maximizing formulation is important as it improves the visual-semantic alignment while resolving the ambiguity between background and unseen. Our approach is inspired by the embodiment theories in cognitive science, that claim human semantic understanding to be grounded in past experiences (seen objects), related linguistic concepts (word dictionary) and the perception of the physical world (visual imagery). To this end, we learn to attend to a dictionary of related semantic concepts that eventually refines the noisy semantic embeddings and helps establish a better synergy between visual and semantic domains. Our extensive results on MS-COCO and Pascal VOC datasets show as high as 14 x mAP improvement over state of the art.
翻译:零射物体探测是一个新兴的研究课题,目的是识别和定位先前的“ 未知” 对象。 这种设置产生了若干独特的挑战,例如,高度不平衡的正反负实例比、背景和看不见阶级之间的模糊性以及视觉和语义概念之间的适当一致性。在这里,我们提出了一个以新的损失功能为支撑的端对端深学习框架,该功能更加强调难例以避免阶级失衡。我们称我们的目标为“实用性损失 ”, 因为它明确将正向和负向预测之间的差距最大化。这种最大化配方很重要,因为它在解决背景和不可见之间的模糊性的同时,改善了视觉和语义的一致性。我们的方法受到认知科学的化理论的启发,该理论主张人类的语义理解以过去的经验(见对象)、相关的语言概念(词典)和对物理世界的看法(图像)为基础。为此,我们学习了一套相关的语义概念的词典,最终改进了振动的语义嵌嵌图,并有助于在视觉和视觉- 和视觉- 图像- 图像- 14 上显示高水平的MS- MS- 的MS- 图像- 改进结果。