BERT is a widely used pre-trained model in natural language processing. However, because its time and space requirements increase with a quadratic level of the text length, the BERT model is difficult to use directly on the long-text corpus. The collected text data is usually quite long in some fields, such as health care. Therefore, to apply the pre-trained language knowledge of BERT to long text, in this paper, imitating the skimming-intensive reading method used by humans when reading a long paragraph, the Skimming-Intensive Model (SkIn) is proposed. It can dynamically select the critical information in the text so that the length of the input into the BERT-Base model is significantly reduced, which can effectively save the cost of the classification algorithm. Experiments show that the SkIn method has achieved better results than the baselines on long-text classification datasets in the medical field, while its time and space requirements increase linearly with the text length, alleviating the time and space overflow problem of BERT on long-text data.
翻译:在自然语言处理中,BERT是一种广泛使用的预先培训的模型。然而,由于时间和空间要求随着文本长度的四级水平而增加,BERT模型很难直接用于长文本体。收集的文本数据通常在某些领域(如保健)相当长。因此,将BERT经过培训的语言知识应用到本文的长文本中,仿照人类在阅读长段时使用的滑动密集阅读方法,提出了滑动强化模型(SkIn)。它可以动态地选择文本中的关键信息,以便大大缩短BERT-Base模型输入的时间长度,从而有效地节省分类算法的成本。实验表明,SKIn方法比医学领域长文本分类数据集的基线取得了更好的结果,而其时间和空间要求随着文本长度的长度而线性地增加,减轻了BERT在长文本数据上的时间和空间溢出的问题。