With the increased accessibility of web and online encyclopedias, the amount of data to manage is constantly increasing. In Wikipedia, for example, there are millions of pages written in multiple languages. These pages contain images that often lack the textual context, remaining conceptually floating and therefore harder to find and manage. In this work, we present the system we designed for participating in the Wikipedia Image-Caption Matching challenge on Kaggle, whose objective is to use data associated with images (URLs and visual data) to find the correct caption among a large pool of available ones. A system able to perform this task would improve the accessibility and completeness of multimedia content on large online encyclopedias. Specifically, we propose a cascade of two models, both powered by the recent Transformer model, able to efficiently and effectively infer a relevance score between the query image data and the captions. We verify through extensive experimentation that the proposed two-model approach is an effective way to handle a large pool of images and captions while maintaining bounded the overall computational complexity at inference time. Our approach achieves remarkable results, obtaining a normalized Discounted Cumulative Gain (nDCG) value of 0.53 on the private leaderboard of the Kaggle challenge.
翻译:随着网络和在线百科全书的可获取性不断提高,管理的数据数量正在不断增加。例如,在维基百科中,有数百万页以多种语言写成。这些页面的图像往往缺乏文字背景,在概念上仍然浮现,因此更难找到和管理。在这项工作中,我们展示了我们设计用于参与维基百科图像与卡格尔匹配挑战的维基百科图像-Caption匹配的系统,该系统的目标是利用与图像相关的数据(URLs和视觉数据)在大量可用图像库中找到正确的标题。一个能够完成这项任务的系统将改善大型在线百科全书多媒体内容的可获取性和完整性。具体地说,我们提出由最近的变换模型驱动的两套模式的系列模式,能够高效和有效地推导出查询图像数据与字幕之间的关联性分数。我们通过广泛的实验核实,拟议的两套模式方法是处理大量图像和字幕的有效方法,同时在推论时将总体计算复杂性捆绑在一起。我们的方法取得了显著的成果,即获取了正常的私人加固度挑战(nDC)。