In recent years, pretrained neural language models (PNLMs) have taken the field of natural language processing by storm, achieving new benchmarks and state-of-the-art performances. These models often rely heavily on annotated data, which may not always be available. Data scarcity are commonly found in specialized domains, such as medical, or in low-resource languages that are underexplored by AI research. In this dissertation, we focus on mitigating data scarcity using data augmentation and neural ensemble learning techniques for neural language models. In both research directions, we implement neural network algorithms and evaluate their impact on assisting neural language models in downstream NLP tasks. Specifically, for data augmentation, we explore two techniques: 1) creating positive training data by moving an answer span around its original context and 2) using text simplification techniques to introduce a variety of writing styles to the original training data. Our results indicate that these simple and effective solutions improve the performance of neural language models considerably in low-resource NLP domains and tasks. For neural ensemble learning, we use a multilabel neural classifier to select the best prediction outcome from a variety of individual pretrained neural language models trained for a low-resource medical text simplification task.
翻译:近年来,经过培训的神经语言模型(PNLMS)在自然语言处理领域以风暴为主,实现了新的基准和最先进的性能,这些模型往往严重依赖附加说明的数据,而这些数据不一定总能获得。数据稀缺通常存在于医学等专门领域,或者在AI研究未得到充分探索的低资源语言中。在这个论文中,我们侧重于利用神经语言模型的数据增强和神经共性学习技术来缓解数据稀缺。在这两个研究方向中,我们采用神经网络算法,并评估其对下游国家语言模型协助神经语言模型的影响。具体地说,为了增强数据,我们探索两种技术:(1) 围绕其原始背景进行回答,创造积极的培训数据;(2) 使用文本简化技术,为原始培训数据引入各种文字风格。我们的结果表明,这些简单有效的解决方案在低资源国家语言模型的域和任务中大大改善了神经语言模型的性能。关于神经元学习,我们使用多标签神经网络算法分类,以从经培训的单个最低文本简化结果中挑选出各种最佳的神经语言模型。