Visual Question Answering (VQA) is a challenging task of natural language processing (NLP) and computer vision (CV), attracting significant attention from researchers. English is a resource-rich language that has witnessed various developments in datasets and models for visual question answering. Visual question answering in other languages also would be developed for resources and models. In addition, there is no multilingual dataset targeting the visual content of a particular country with its own objects and cultural characteristics. To address the weakness, we provide the research community with a benchmark dataset named EVJVQA, including 33,000+ pairs of question-answer over three languages: Vietnamese, English, and Japanese, on approximately 5,000 images taken from Vietnam for evaluating multilingual VQA systems or models. EVJVQA is used as a benchmark dataset for the challenge of multilingual visual question answering at the 9th Workshop on Vietnamese Language and Speech Processing (VLSP 2022). This task attracted 62 participant teams from various universities and organizations. In this article, we present details of the organization of the challenge, an overview of the methods employed by shared-task participants, and the results. The highest performances are 0.4392 in F1-score and 0.4009 in BLUE on the private test set. The multilingual QA systems proposed by the top 2 teams use ViT for the pre-trained vision model and mT5 for the pre-trained language model, a powerful pre-trained language model based on the transformer architecture. EVJVQA is a challenging dataset that motivates NLP and CV researchers to further explore the multilingual models or systems for visual question answering systems.
翻译:视觉问答(VQA)是自然语言处理(NLP)和计算机视觉(CV)的一项艰巨任务,吸引研究人员的极大关注。英语是一种资源丰富的语言,在数据集和视觉问答模型方面有各种发展,还将为资源和模型开发其他语言的视觉问答(VQA),此外,没有针对特定具有自身目标和文化特点的国家的视觉内容的多语言数据集。为了应对这一缺陷,我们向研究界提供了名为EVJVQA的基准数据集,包括33,000+对三种语言(越南语、英语和日语)的问答:33,000+对越南语、英语和日语,从越南拍摄了大约5,000张用于评价多语言VA系统或模式的图像。EVJVQA将作为一个基准数据集,用于在越南语言和语言处理前第9期讲习班(VLSP 2022)上回答多语言的视觉问答。这项任务吸引了来自不同大学和组织62个参与者团队参加。 在文章中,我们介绍了挑战的组织安排的详情,介绍了共同语言变语言模型模型模型参与者使用的方法概览,在VVA2级模型中,在FMYA最高测试系统使用。