Phrase grounding models localize an object in the image given a referring expression. The annotated language queries available during training are limited, which also limits the variations of language combinations that a model can see during training. In this paper, we study the case applying objects without labeled queries for training the semi-supervised phrase grounding. We propose to use learned location and subject embedding predictors (LSEP) to generate the corresponding language embeddings for objects lacking annotated queries in the training set. With the assistance of the detector, we also apply LSEP to train a grounding model on images without any annotation. We evaluate our method based on MAttNet on three public datasets: RefCOCO, RefCOCO+, and RefCOCOg. We show that our predictors allow the grounding system to learn from the objects without labeled queries and improve accuracy by 34.9\% relatively with the detection results.
翻译:词组定位模型将图像中的某个对象定位于一个引用表达式中。 培训期间可用的附加说明的语言查询有限, 也限制了模型在培训期间可以看到的语言组合的变异性。 在本文中, 我们研究应用对象而不贴标签查询用于培训半监督的词组定位。 我们建议使用学习位置和主题嵌入预测器( LSEP) 来生成相应的语言嵌入在培训集中没有附加说明查询的物体。 在探测器的协助下, 我们还应用 LSEP 在不作任何注解的情况下对图像进行定位模型培训。 我们用三个公共数据集( RefCOCO、 RefCO+ 和 RefCOCOg)来评估我们基于MATNet的方法。 我们显示, 我们的预测器允许地面系统在没有贴标签查询的情况下向对象学习, 并相对通过检测结果来提高精确度34.9 ⁇ 。