Many models have been proposed for vision and language tasks, especially the image-text retrieval task. All state-of-the-art (SOTA) models in this challenge contained hundreds of millions of parameters. They also were pretrained on a large external dataset that has been proven to make a big improvement in overall performance. It is not easy to propose a new model with a novel architecture and intensively train it on a massive dataset with many GPUs to surpass many SOTA models, which are already available to use on the Internet. In this paper, we proposed a compact graph-based framework, named HADA, which can combine pretrained models to produce a better result, rather than building from scratch. First, we created a graph structure in which the nodes were the features extracted from the pretrained models and the edges connecting them. The graph structure was employed to capture and fuse the information from every pretrained model with each other. Then a graph neural network was applied to update the connection between the nodes to get the representative embedding vector for an image and text. Finally, we used the cosine similarity to match images with their relevant texts and vice versa to ensure a low inference time. Our experiments showed that, although HADA contained a tiny number of trainable parameters, it could increase baseline performance by more than 3.6% in terms of evaluation metrics in the Flickr30k dataset. Additionally, the proposed model did not train on any external dataset and did not require many GPUs but only 1 to train due to its small number of parameters. The source code is available at https://github.com/m2man/HADA.
翻译:为愿景和语言任务,特别是图像文本检索任务,提出了许多模型。在这一挑战中,所有最先进的(SOTA)模型都包含数亿个参数。它们还被预先训练在大型外部数据集上,该数据集已被证明能够大大改进总体性能。使用新结构来捕捉和整合来自每个预先培训模型的信息并非易事。然后,用一个图表神经网络来更新模式之间的连接,以便仅将具有代表性的嵌入矢量用于一个图像和文本。最后,我们使用了一个缩略图框架,即名为 MADA,它可以将预先训练的模型组合起来,以产生更好的结果,而不是从零开始建立参数。首先,我们创建了一个图表结构,其中节点是从预先培训的模型中提取的特征以及连接这些模型的边缘。图表结构被用来捕捉和整合来自每个预先培训模型的大规模数据集。然后用一个图表神经网络来更新模式之间的连接,以便仅将具有代表性的矢量嵌入到任何图像和文本。最后,我们使用了该节比预设的模型更接近图像,而不是从从零开始建立参数。首先将图像和列参数进行。首先将图像。我们用了一个图表结构的图像的图像的图像的图像的图像的图像的图像的图像的图像匹配匹配图像,然后用一个图表的图像的图像的图像的图像的图像的图像的图像的图像的图像的图像的图像的图像的图像的图像比比,而用到列的图像的图像的图像的图像的图像的图像的图像的图像的图像的图像的图像的图像的图像比,而用到一个模型,而用到一个小的模型,而用一个小数,而用到一个模型的模型的模型的模型的模型的模型的模型显示的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型的模型,可以增加的模型的模型的模型的模型的