Domain adaptation for large neural language models (NLMs) is coupled with massive amounts of unstructured data in the pretraining phase. In this study, however, we show that pretrained NLMs learn in-domain information more effectively and faster from a compact subset of the data that focuses on the key information in the domain. We construct these compact subsets from the unstructured data using a combination of abstractive summaries and extractive keywords. In particular, we rely on BART to generate abstractive summaries, and KeyBERT to extract keywords from these summaries (or the original unstructured text directly). We evaluate our approach using six different settings: three datasets combined with two distinct NLMs. Our results reveal that the task-specific classifiers trained on top of NLMs pretrained using our method outperform methods based on traditional pretraining, i.e., random masking on the entire data, as well as methods without pretraining. Further, we show that our strategy reduces pretraining time by up to five times compared to vanilla pretraining. The code for all of our experiments is publicly available at https://github.com/shahriargolchin/compact-pretraining.
翻译:大型神经语言模型(NLMS)的校外适应与培训前阶段的大量非结构化数据相结合。然而,在本研究中,我们显示,预先培训的NLMS从侧重于域内关键信息的紧凑数据子集中更有效更快地学习了部内信息。我们使用抽象摘要和采掘关键词组合,从非结构化数据中构建了这些紧凑子集。特别是,我们依靠BART生成抽象摘要,KeyBERT从这些摘要(或原始非结构化文本直接)中提取关键词。我们用六种不同的设置来评估我们的方法:三个数据集结合两个不同的NLMS。我们的结果显示,在NLMS顶部培训的任务分类人员先用我们基于传统预培训的方法,即随机遮盖整个数据的方法,以及不经过预先训练的方法,先行。我们的战略比Vanilla预培训前减少5次。我们所有实验的代码在 httpsprepreubol/comactriargrg中公开提供。