Vocabulary transfer is a transfer learning subtask in which language models fine-tune with the corpus-specific tokenization instead of the default one, which is being used during pretraining. This usually improves the resulting performance of the model, and in the paper, we demonstrate that vocabulary transfer is especially beneficial for medical text processing. Using three different medical natural language processing datasets, we show vocabulary transfer to provide up to ten extra percentage points for the downstream classifier accuracy.
翻译:词汇传输是一种转移学习子任务,其中语言模式与具体体格符号的微调而不是在培训前使用的默认符号进行微调,这通常会改善模型的性能,在文件中,我们证明词汇传输对医学文本处理特别有益。我们用三种不同的医学自然语言处理数据集显示词汇传输,为下游分类准确性提供最多10个百分点的额外百分点。