Matching model is essential for Image-Text Retrieval framework. Existing research usually train the model with a triplet loss and explore various strategy to retrieve hard negative sentences in the dataset. We argue that current retrieval-based negative sample construction approach is limited in the scale of the dataset thus fail to identify negative sample of high difficulty for every image. We propose our TAiloring neGative Sentences with Discrimination and Correction (TAGS-DC) to generate synthetic sentences automatically as negative samples. TAGS-DC is composed of masking and refilling to generate synthetic negative sentences with higher difficulty. To keep the difficulty during training, we mutually improve the retrieval and generation through parameter sharing. To further utilize fine-grained semantic of mismatch in the negative sentence, we propose two auxiliary tasks, namely word discrimination and word correction to improve the training. In experiments, we verify the effectiveness of our model on MS-COCO and Flickr30K compared with current state-of-the-art models and demonstrates its robustness and faithfulness in the further analysis. Our code is available in https://github.com/LibertFan/TAGS.
翻译:匹配模型对于图像- Text Retreival 框架至关重要。 现有的研究通常对模型进行三重损失培训,并探索各种战略,以检索数据集中的硬性负句子。 我们争辩说,当前基于检索的负面样本构建方法在数据集的规模上有限,因此无法找出每个图像高度困难的负面样本。 我们建议用“歧视与校正(TAGS-DC)”来自动生成合成句子,作为负面样本。 TAGS-DC 由掩蔽和再填以生成合成负句子组成, 难度更大。 为了在培训过程中保持难度, 我们通过共享参数来相互改进检索和生成。 为了进一步利用负面句子中错配的精细精细的语义, 我们提议了两项辅助任务, 即字式歧视和字式校正来改进培训。 在实验中, 我们核查我们的MS- CO 和 FlFlick30K 模型与当前状态- 艺术模型的效能, 并在进一步分析中显示其坚固性和忠诚性。 我们的代码可以在 https://github.com/ LiberfFan/TA.