Image-text retrieval which associates different modalities has drawn broad attention due to its excellent research value and broad real-world application. While the algorithms keep updated, most of them haven't taken the high-level semantic relationships ("style embedding") and common knowledge from multi-modalities into full consideration. To this end, we propose a novel style transformer network with common knowledge optimization (CKSTN) for image-text retrieval. The main module is the common knowledge adaptor (CKA) with both the style embedding extractor (SEE) and the common knowledge optimization (CKO) modules. Specifically, the SEE is designed to effectively extract high-level features. The CKO module is introduced to dynamically capture the latent concepts of common knowledge from different modalities. Together, they could assist in the formation of item representations in lightweight transformers. Besides, to get generalized temporal common knowledge, we propose a sequential update strategy to effectively integrate the features of different layers in SEE with previous common feature units. CKSTN outperforms the results of state-of-the-art methods in image-text retrieval on MSCOCO and Flickr30K datasets. Moreover, CKSTN is more convenient and practical for the application of real scenes, due to the better performance and lower parameters.
翻译:将不同模式联系起来的图像文本检索因其出色的研究价值和广泛的现实应用而引起广泛关注。 虽然算法不断更新, 但大多数算法没有充分考虑到高级语义关系( “ 风格嵌入 ” ) 和来自多种模式的共同知识。 为此, 我们提议建立一个具有共同知识优化的新型变压器网络( CKSTN ) 用于图像文本检索。 主要模块是通用知识调整器( CKKA ), 包括风格嵌入提取器( SEE) 和共同知识优化模块。 具体来说, SEE 旨在有效地提取高层次特征。 引入 CKO 模块是为了动态地捕捉不同模式中共同知识的潜在概念。 一起, 它们可以帮助在轻量变压变压器中形成项目表达方式。 此外, 为了获得普遍的时间通用知识, 我们提出一个按顺序更新战略, 以有效地将SEEE 不同层次的特征与以前的通用特征单位整合起来。 CKSTN 超越了在图像检索中采用最新方法的结果, 和Flick30K 的更方便性参数应用。</s>