Visual Question Answering (VQA) is a task that requires computers to give correct answers for the input questions based on the images. This task can be solved by humans with ease but is a challenge for computers. The VLSP2022-EVJVQA shared task carries the Visual Question Answering task in the multilingual domain on a newly released dataset: UIT-EVJVQA, in which the questions and answers are written in three different languages: English, Vietnamese and Japanese. We approached the challenge as a sequence-to-sequence learning task, in which we integrated hints from pre-trained state-of-the-art VQA models and image features with Convolutional Sequence-to-Sequence network to generate the desired answers. Our results obtained up to 0.3442 by F1 score on the public test set, 0.4210 on the private test set, and placed 3rd in the competition.
翻译:视觉问答(VQA)是一项任务,需要计算机基于图像给出正确的回答。这项任务对于人类来说很容易解决,但对于计算机来说却是个挑战。VLSP2022-EVJVQA 共享任务针对面向多语言领域的视觉问答任务提供了一个新的数据集:UIT-EVJVQA,其中问题和答案用三种不同语言编写:英语、越南语和日语。我们将这个挑战看作是一个序列到序列学习任务,将预训练的最先进 VQA 模型和图像特征的细节与卷积序列到序列网络相结合,生成所需的答案。我们的结果在公共测试集上达到了0.3442的 F1 分数,在私有测试集上达到了0.4210的 F1 分数,并在竞赛中获得第三名。