Traditional semantic image search methods aim to retrieve images that match the meaning of the text query. However, these methods typically search for objects on the whole image, without considering the localization of objects within the image. This paper presents an extension of existing object detection models for semantic image search that considers the semantic alignment between object proposals and text queries, with a focus on searching for objects within images. The proposed model uses a single feature extractor, a pre-trained Convolutional Neural Network, and a transformer encoder to encode the text query. Proposal-text alignment is performed using contrastive learning, producing a score for each proposal that reflects its semantic alignment with the text query. The Region Proposal Network (RPN) is used to generate object proposals, and the end-to-end training process allows for an efficient and effective solution for semantic image search. The proposed model was trained end-to-end, providing a promising solution for semantic image search that retrieves images that match the meaning of the text query and generates semantically relevant object proposals.
翻译:传统的语义图像搜索方法旨在检索与文本查询含义相符的图像。 然而, 这些方法通常在不考虑图像中对象本地化的情况下, 在整个图像中搜索对象。 本文展示了现有的语义图像搜索对象探测模型的延伸, 以考虑对象建议和文本查询之间的语义匹配, 重点是搜索图像中的对象。 拟议的模型使用单一特征提取器、 预先培训的革命神经网络 和变压器编码器来编码文本查询。 使用对比性学习来进行建议文本对齐, 为每个建议生成一个分数, 以反映其语义与文本查询的对齐。 区域建议网络( RPN) 用于生成对象建议, 而端对端培训过程允许为语义图像搜索提供高效有效的解决方案。 拟议的模型经过培训, 终端对端, 为语义图像搜索提供了有希望的解决方案, 与文本查询的意义相符的图像检索并生成语义相关对象建议。