Recent advances in the field of language modeling have improved the state-of-the-art in question answering (QA) and question generation (QG). However, the development of modern neural models, their benchmarks, and datasets for training them has mainly focused on English. Finnish, like many other languages, faces a shortage of large QA/QG model training resources, which has prevented experimenting with state-of-the-art QA/QG fine-tuning methods. We present the first neural QA and QG models that work with Finnish. To train the models, we automatically translate the SQuAD dataset and then use normalization methods to reduce the amount of problematic data created during the translation. Using the synthetic data, together with the Finnish partition of the TyDi-QA dataset, we fine-tune several transformer-based models to both QA and QG and evaluate their performance. To the best of our knowledge, the resulting dataset is the first large-scale QA/QG resource for Finnish. This paper also sets the initial benchmarks for Finnish-language QA and QG.
翻译:语言建模领域的最新进展改善了相关回答和问题生成领域的最新进展,然而,现代神经模型、其基准和用于培训这些模型的数据集的开发主要侧重于英语。芬兰语与其他许多语文一样,面临大量质量A/QG模式培训资源的短缺,这妨碍了对最新QA/QG微调方法的实验。我们介绍了与芬兰人合作的第一个神经QA和QG模型。为培训模型,我们自动翻译SQuAD数据集,然后使用正常化方法减少翻译过程中产生的问题数据数量。我们利用合成数据以及芬兰对TyDi-QA数据集的分区,对若干基于变压器的模型进行微调,对QA和QG进行微调评估。据我们所知,由此产生的数据集是芬兰人的第一个大规模QA/QG资源。本文还为芬兰语QA和QG设定了初步基准。