标题：风格转换网络联合常识优化用于图像-文本检索摘要：图像-文本检索因其在学术研究和实际应用中的广泛价值受到广泛关注。然而，现有方法大多数尚未充分考虑高层语义关系（“样式嵌入”）和多模态情况下的共同知识。为此，我们引入了一种新的风格转换网络，用于图像-文本检索的联合常识优化（CKSTN）。CKSTN的主要模块是具有风格嵌入提取器（SEE）和常识优化（CKO）模块的常识适配器（CKA）。具体而言，SEE使用顺序更新策略来有效连接SEE中不同阶段的特征。CKO模块是为了动态地从不同模态中捕获共同知识的潜在概念。此外，为了获得广义的时间常识，我们提出了一种顺序更新策略，以有效地将SEE的不同层的特征与先前的共同特征单元集成。CKSTN展示了在MSCOCO和Flickr30K数据集上超过现有方法的优越性。此外，CKSTN基于轻量级变压器构建，由于性能更好且参数更少，因此在实际场景的应用中更为便捷和实用。 (The style transformer with common knowledge optimization for image-text retrieval)

翻译：标题：风格转换网络联合常识优化用于图像-文本检索摘要：图像-文本检索因其在学术研究和实际应用中的广泛价值受到广泛关注。然而，现有方法大多数尚未充分考虑高层语义关系（“样式嵌入”）和多模态情况下的共同知识。为此，我们引入了一种新的风格转换网络，用于图像-文本检索的联合常识优化（CKSTN）。CKSTN的主要模块是具有风格嵌入提取器（SEE）和常识优化（CKO）模块的常识适配器（CKA）。具体而言，SEE使用顺序更新策略来有效连接SEE中不同阶段的特征。CKO模块是为了动态地从不同模态中捕获共同知识的潜在概念。此外，为了获得广义的时间常识，我们提出了一种顺序更新策略，以有效地将SEE的不同层的特征与先前的共同特征单元集成。CKSTN展示了在MSCOCO和Flickr30K数据集上超过现有方法的优越性。此外，CKSTN基于轻量级变压器构建，由于性能更好且参数更少，因此在实际场景的应用中更为便捷和实用。

Wenrui Li,Zhengyu Ma,Jinqiao Shi,Xiaopeng Fan

Image-text retrieval which associates different modalities has drawn broad attention due to its excellent research value and broad real-world application. However, most of the existing methods haven't taken the high-level semantic relationships ("style embedding") and common knowledge from multi-modalities into full consideration. To this end, we introduce a novel style transformer network with common knowledge optimization (CKSTN) for image-text retrieval. The main module is the common knowledge adaptor (CKA) with both the style embedding extractor (SEE) and the common knowledge optimization (CKO) modules. Specifically, the SEE uses the sequential update strategy to effectively connect the features of different stages in SEE. The CKO module is introduced to dynamically capture the latent concepts of common knowledge from different modalities. Besides, to get generalized temporal common knowledge, we propose a sequential update strategy to effectively integrate the features of different layers in SEE with previous common feature units. CKSTN demonstrates the superiorities of the state-of-the-art methods in image-text retrieval on MSCOCO and Flickr30K datasets. Moreover, CKSTN is constructed based on the lightweight transformer which is more convenient and practical for the application of real scenes, due to the better performance and lower parameters.

翻译：