HADA: 图像文本检索中基于图表的合并框架 (HADA: A Graph-based Amalgamation Framework in Image-text Retrieval)

Many models have been proposed for vision and language tasks, especially the image-text retrieval task. All state-of-the-art (SOTA) models in this challenge contained hundreds of millions of parameters. They also were pretrained on a large external dataset that has been proven to make a big improvement in overall performance. It is not easy to propose a new model with a novel architecture and intensively train it on a massive dataset with many GPUs to surpass many SOTA models, which are already available to use on the Internet. In this paper, we proposed a compact graph-based framework, named HADA, which can combine pretrained models to produce a better result, rather than building from scratch. First, we created a graph structure in which the nodes were the features extracted from the pretrained models and the edges connecting them. The graph structure was employed to capture and fuse the information from every pretrained model with each other. Then a graph neural network was applied to update the connection between the nodes to get the representative embedding vector for an image and text. Finally, we used the cosine similarity to match images with their relevant texts and vice versa to ensure a low inference time. Our experiments showed that, although HADA contained a tiny number of trainable parameters, it could increase baseline performance by more than 3.6% in terms of evaluation metrics in the Flickr30k dataset. Additionally, the proposed model did not train on any external dataset and did not require many GPUs but only 1 to train due to its small number of parameters. The source code is available at https://github.com/m2man/HADA.

翻译：为愿景和语言任务,特别是图像文本检索任务,提出了许多模型。在这一挑战中,所有最先进的(SOTA)模型都包含数亿个参数。它们还被预先训练在大型外部数据集上,该数据集已被证明能够大大改进总体性能。使用新结构来捕捉和整合来自每个预先培训模型的信息并非易事。然后,用一个图表神经网络来更新模式之间的连接,以便仅将具有代表性的嵌入矢量用于一个图像和文本。最后,我们使用了一个缩略图框架,即名为 MADA,它可以将预先训练的模型组合起来,以产生更好的结果,而不是从零开始建立参数。首先,我们创建了一个图表结构,其中节点是从预先培训的模型中提取的特征以及连接这些模型的边缘。图表结构被用来捕捉和整合来自每个预先培训模型的大规模数据集。然后用一个图表神经网络来更新模式之间的连接,以便仅将具有代表性的矢量嵌入到任何图像和文本。最后,我们使用了该节比预设的模型更接近图像,而不是从从零开始建立参数。首先将图像和列参数进行。首先将图像。我们用了一个图表结构的图像的图像的图像的图像的图像的图像的图像的图像的图像匹配匹配图像,然后用一个图表的图像的图像的图像的图像的图像的图像的图像的图像的图像的图像的图像的图像的图像的图像比比,而用到列的图像的图像的图像的图像的图像的图像的图像的图像的图像的图像的图像的图像的图像比,而用到一个模型,而用到一个小的模型,而用一个小数,而用到一个模型的模型的模型的模型的模型的模型的模型显示的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型,可以增加的模型的模型的模型的模型的

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日