Contextual embedding-based language models trained on large data sets, such as BERT and RoBERTa, provide strong performance across a wide range of tasks and are ubiquitous in modern NLP. It has been observed that fine-tuning these models on tasks involving data from domains different from that on which they were pretrained can lead to suboptimal performance. Recent work has explored approaches to adapt pretrained language models to new domains by incorporating additional pretraining using domain-specific corpora and task data. We propose an alternative approach for transferring pretrained language models to new domains by adapting their tokenizers. We show that domain-specific subword sequences can be efficiently determined directly from divergences in the conditional token distributions of the base and domain-specific corpora. In datasets from four disparate domains, we find adaptive tokenization on a pretrained RoBERTa model provides >97% of the performance benefits of domain specific pretraining. Our approach produces smaller models and less training and inference time than other approaches using tokenizer augmentation. While adaptive tokenization incurs a 6% increase in model parameters in our experimentation, due to the introduction of 10k new domain-specific tokens, our approach, using 64 vCPUs, is 72x faster than further pretraining the language model on domain-specific corpora on 8 TPUs.
翻译:在大型数据集(如BERT和ROBERTA)方面受过培训的基于内嵌背景的语言模型在大型数据集(如BERT和ROBERTA)方面表现良好,在各种任务中表现良好,在现代NLP中无处不在。人们注意到,在涉及来自不同领域的数据的任务方面对这些模型进行微调,与它们事先培训的领域不同,可能导致业绩欠佳。最近的工作探索了将预先培训的语言模型调整到新领域的方法,通过使用特定领域公司和任务数据纳入额外的培训前培训。我们提出了将预先培训的语言模型转换到新领域的替代办法,通过调整其代号器来调整。我们显示,根据基础和特定领域集体体有条件象征性分布的差异,可以有效地直接确定特定领域子字序列。在四个不同领域的数据集中,我们发现对事先培训的ROBERTA模型的适应性象征值为域特定培训的97%以上。我们的方法产生较小的模型,比使用代号加增的其他方法少培训和推断时间。在实验中,适应性代号符号的代号序列参数增加了6%的模型参数参数,在实验中增加了6%的模型参数,在TAPRAVICRBROCEBE进一步引入了10域域域前。