Powerful generative models have led to recent progress in question generation (QG). However, it is difficult to measure advances in QG research since there are no standardized resources that allow a uniform comparison among approaches. In this paper, we introduce QG-Bench, a multilingual and multidomain benchmark for QG that unifies existing question answering datasets by converting them to a standard QG setting. It includes general-purpose datasets such as SQuAD for English, datasets from ten domains and two styles, as well as datasets in eight different languages. Using QG-Bench as a reference, we perform an extensive analysis of the capabilities of language models for the task. First, we propose robust QG baselines based on fine-tuning generative language models. Then, we complement automatic evaluation based on standard metrics with an extensive manual evaluation, which in turn sheds light on the difficulty of evaluating QG models. Finally, we analyse both the domain adaptability of these models as well as the effectiveness of multilingual models in languages other than English. QG-Bench is released along with the fine-tuned models presented in the paper https://github.com/asahi417/lm-question-generation, which are also available as a demo https://autoqg.net/.
翻译:· 然而,很难衡量QG研究的进展,因为没有标准化的资源,因此无法对各种方法进行统一比较。在本文件中,我们为QG引入了QG-Bench,这是QG的多语种和多域基准,通过将其转换为标准的QG设置,统一现有回答数据集的问题;它包括通用数据集,如英文的SQUAD、十个域和两种风格的数据集以及八种不同语文的数据集。我们利用QG-Bench作为参照,对任务语言模型的能力进行广泛分析。首先,我们根据微调的基因化语言模型,提出了强有力的QG-Bench。然后,我们用广泛的手工评估来补充基于标准计量的自动评价,这反过来又说明了评价QG模型的困难。最后,我们分析了这些模型的域适应性以及英文以外其他语文的多语种模型的有效性。QG-Bench作为参考,我们对任务语言模型的能力进行了广泛分析。首先,我们提出了基于精细调整的变义语言模型。然后,我们又提出了基于标准衡量的自动评价的QG-QQA/QQQA的模型。