Phrase detection requires methods to identify if a phrase is relevant to an image and localize it, if applicable. A key challenge for training more discriminative detection models is sampling negatives. Sampling techniques from prior work focus primarily on hard, often noisy, negatives disregarding the broader distribution of negative samples. Our proposed CFCD-Net addresses this through two novels methods. First, we generate groups of semantically similar words we call concepts (\eg, \{dog, cat, horse\} and \ \{car, truck, SUV\}), and then train our CFCD-Net to discriminate between a region of interest and its unrelated concepts. Second, for phrases containing fine-grained mutually-exclusive words (\eg, colors), we force the model to select only one applicable phrase for each region using our novel fine-grained module (FGM). We evaluate our approach on Flickr30K Entities and RefCOCO+, where we improve mAP over the state-of-the-art by 1.5-2 points. When considering only the phrases affected by our FGM module, we improve by 3-4 points on both datasets.
翻译:发号施令的检测方法要求确定一个短语是否与图像相关,如果适用的话,将其本地化。培训更具歧视性的检测模型的一个关键挑战是抽样反面。 先前工作中的抽样技术主要侧重于硬的、往往吵闹的、无视否定样本广泛分布的负面的词组。 我们提议的CFCD-Net通过两种小说方法解决这个问题。 首先,我们生成了我们称之为概念的语义相似的词组(\eg, ⁇ dog, cat, horse, 和\\\ ⁇ car, truck, SUV ⁇ ),然后培训我们的CFCD-Net来区分一个感兴趣的区域及其无关的概念。 其次,对于含有精细区分的相互排斥词组(\eg, 颜色),我们强迫该模型只为每个区域选择一个适用词组,使用我们新型精细的模块(FGM) 30K 实体和 RefCO +,我们用1.5-2点的方法改进了MAP,我们只考虑受我们女性生殖器模块影响的词组影响的词组,我们用3-4点改进了两个数据组。