越南语言教育的多项选择阅读理解语料库 (A Multiple Choices Reading Comprehension Corpus for Vietnamese Language Education)

Machine reading comprehension has been an interesting and challenging task in recent years, with the purpose of extracting useful information from texts. To attain the computer ability to understand the reading text and answer relevant information, we introduce ViMMRC 2.0 - an extension of the previous ViMMRC for the task of multiple-choice reading comprehension in Vietnamese Textbooks which contain the reading articles for students from Grade 1 to Grade 12. This dataset has 699 reading passages which are prose and poems, and 5,273 questions. The questions in the new dataset are not fixed with four options as in the previous version. Moreover, the difficulty of questions is increased, which challenges the models to find the correct choice. The computer must understand the whole context of the reading passage, the question, and the content of each choice to extract the right answers. Hence, we propose the multi-stage approach that combines the multi-step attention network (MAN) with the natural language inference (NLI) task to enhance the performance of the reading comprehension model. Then, we compare the proposed methodology with the baseline BERTology models on the new dataset and the ViMMRC 1.0. Our multi-stage models achieved 58.81% by Accuracy on the test set, which is 5.34% better than the highest BERTology models. From the results of the error analysis, we found the challenge of the reading comprehension models is understanding the implicit context in texts and linking them together in order to find the correct answers. Finally, we hope our new dataset will motivate further research in enhancing the language understanding ability of computers in the Vietnamese language.

翻译：机器阅读理解是近年来一个有趣和具有挑战性的任务，其目的是从文本中提取有用信息。为了使计算机能够理解阅读文本并回答相关信息，我们介绍ViMMRC 2.0——前一版ViMMRC的扩展版本，用于越南课本的多项选择阅读理解任务。该数据集包含699个阅读段落，其中包括散文和诗歌，以及5,273个问题。新数据集中的问题不是像以前的版本一样固定有四个选项。此外，问题难度增加，这使得模型寻找正确的选择更具挑战性。计算机必须理解阅读段落的整个上下文，问题和每个选项的内容，以提取出正确答案。因此，我们提出了多阶段方法，将多步注意网络（MAN）与自然语言推理（NLI）任务相结合，以增强阅读理解模型的性能。然后，我们将所提出的方法与基准的BERTology模型在新的数据集（ViMMRC 2.0）和ViMMRC 1.0上进行比较。我们的多阶段模型在测试集上的准确率为58.81％，比最高BERTology模型提高了5.34％。从错误分析的结果来看，我们发现阅读理解模型面临的挑战是理解文本中的隐含语境并将它们联系起来以找到正确答案。最后，我们希望我们的新数据集能够激发更多的研究，以提高计算机在越南语言上的语言理解能力。