Phrase detection requires methods to identify if a phrase is relevant to an image and localize it if applicable. A key challenge in training more discriminative phrase detection models is sampling negatives. However, sampling techniques from prior work focus primarily on hard, often noisy, negatives disregarding the broader distribution of negative samples. To address this problem, we introduce CFCD-Net, a phrase detector that differentiates between phrases through two novels methods. First, we generate groups that consist of semantically similar words we call concepts (eg {dog, cat, horse, ...} vs. car, truck, ...}), and then train our CFCD-Net to discriminate between a region of interest and its unrelated concepts. Second, for phrases containing fine-grained mutually-exclusive words (eg colors), we force the model into selecting only one applicable phrase for each region using our novel fine grained module (FGM). We evaluate our approach on the Flickr30K Entities and RefCOCO+ datasets, where we improve mAP over the state-of-the-art by 1.5-2 points. When considering only the phrases affected by our fine-grained reasoning module, we improve by 3-4 points on both datasets.
翻译:为了解决这个问题,我们引入了CFCD-Net,这是一个通过两种小说方式区分词组的词组。首先,我们生成了由我们称之为概念的语义相似的词组(例如{狗、猫、马、...}诉汽车、卡车、...}),然后培训我们的CFD-Net,以区分一个感兴趣的区域及其不相关的概念。第二,对于含有细微区别的相互排斥的词组(如颜色),我们强迫该模型只为每个区域选择一个适用词组,使用我们的新颖精细的粒状模块(FGM)。我们评估了我们在Flickr30K实体和RefCO+数据集上的做法,我们用1.5-2点的方法改进了MAP对现状的定位,我们用3-4号模块改进了数据。