Phrase detection requires methods to identify if a phrase is relevant to an image and then localize it if applicable. A key challenge in training more discriminative phrase detection models is sampling hard-negatives. This is because few phrases are annotated of the nearly infinite variations that may be applicable. To address this problem, we introduce PFP-Net, a phrase detector that differentiates between phrases through two novel methods. First, we group together phrases of related objects into coarse groups of visually coherent concepts (eg animals vs automobiles), and then train our PFP-Net to discriminate between them according to their concept membership. Second, for phrases containing fine grained mutually-exclusive tokens (eg colors), we force the model into selecting only one applicable phrase for each region. We evaluate our approach on the Flickr30K Entities and RefCOCO+ datasets, where we improve mAP over the state-of-the-art by 1-1.5 points over all phrases on this challenging task. When considering only the phrases affected by our fine-grained reasoning module, we improve by 1-4 points on both datasets.
翻译:为了解决这一问题,我们引入了PFP-Net, 这是一种通过两种新颖方法区分语系的词组。首先,我们将相关对象的词组组合为视觉一致概念的粗糙组(如动物对汽车),然后根据概念成员情况培训我们的PFP-Net,对它们加以区分。第二,对于含有精细的相互排斥符号(如颜色)的词组,我们强制模型为每个区域选择一个可适用的词组。我们评估了我们对Flick30K实体和RefCO+数据集的处理办法,我们用两种新方法将这两个词组改进了MAP对当前状态的处理方式,对这项具有挑战性的任务的所有词组进行了1-1.5分点的改进。我们仅考虑受我们精细推理学模块影响的词组时,我们只对这两个数据组作了1-4点的改进。