Pre-trained models are widely used in the tasks of natural language processing nowadays. However, in the specific field of text simplification, the research on improving pre-trained models is still blank. In this work, we propose a continued pre-training method for text simplification. Specifically, we propose a new masked language modeling (MLM) mechanism, which does not randomly mask words but only masks simple words. The new mechanism can make the model learn to generate simple words. We use a small-scale simple text dataset for continued pre-training and employ two methods to identify simple words from the texts. We choose BERT, a representative pre-trained model, and continue pre-training it using our proposed method. Finally, we obtain SimpleBERT, which surpasses BERT in both lexical simplification and sentence simplification tasks and has achieved state-of-the-art results on multiple datasets. What's more, SimpleBERT can replace BERT in existing simplification models without modification.
翻译:目前,在自然语言处理的任务中广泛使用经过预先培训的模型。然而,在文本简化的具体领域,关于改进经过培训的模型的研究仍然空白。在这项工作中,我们建议继续采用文字简化的培训前方法。具体地说,我们建议采用新的隐蔽语言建模机制(MLM),它不是随机地掩盖文字,而只是掩盖简单的单词。新机制可以使模型学习生成简单的单词。我们用一个小规模的简单文本数据集来继续预培训,并使用两种方法从文本中找出简单的单词。我们选择了具有代表性的经过培训的模型BERT, 继续使用我们提议的方法进行预培训。最后,我们获得了简单的BERT, 它在简化和简化刑期的任务中都超过了BERT, 在多个数据集中取得了最新的结果。此外,简单的BERT可以在现有的简化模型中不作任何修改地取代BERT。