Question Answering (QA) is a task in natural language processing that has seen considerable growth after the advent of transformers. There has been a surge in QA datasets that have been proposed to challenge natural language processing models to improve human and existing model performance. Many pre-trained language models have proven to be incredibly effective at the task of extractive question answering. However, generalizability remains as a challenge for the majority of these models. That is, some datasets require models to reason more than others. In this paper, we train various pre-trained language models and fine-tune them on multiple question answering datasets of varying levels of difficulty to determine which of the models are capable of generalizing the most comprehensively across different datasets. Further, we propose a new architecture, BERT-BiLSTM, and compare it with other language models to determine if adding more bidirectionality can improve model performance. Using the F1-score as our metric, we find that the RoBERTa and BART pre-trained models perform the best across all datasets and that our BERT-BiLSTM model outperforms the baseline BERT model.
翻译:问题解答(QA)是自然语言处理中的一项任务,在变压器出现后,自然语言处理有了相当大的增长。质量解答(QA)数据集出现激增,建议质疑自然语言处理模型以改善人类和现有模型的性能。许多经过事先培训的语言模型已证明在解答问题的任务中非常有效。然而,对于大多数这些模型来说,普遍性仍然是一项挑战。也就是说,一些数据集要求模型比其他模型更合理。在本文中,我们培训了各种经过培训的语言模型,并精细调整了这些模型的多问题解答数据集,以确定哪些模型能够对不同的数据集进行最全面的概括化。此外,我们提出了一个新的结构,即BERT-BILSTM, 并将其与其他语言模型进行比较,以确定增加更多的双向性能能够改进模型的性能。我们用F1核心作为衡量尺度,我们发现RoBERTA和BART预培训模型在所有数据集中表现最佳,我们的BERTER-BILSTM模型超越基准模型。