长文档文件分类的等级神经网络方法 (Hierarchical Neural Network Approaches for Long Document Classification)

Text classification algorithms investigate the intricate relationships between words or phrases and attempt to deduce the document's interpretation. In the last few years, these algorithms have progressed tremendously. Transformer architecture and sentence encoders have proven to give superior results on natural language processing tasks. But a major limitation of these architectures is their applicability for text no longer than a few hundred words. In this paper, we explore hierarchical transfer learning approaches for long document classification. We employ pre-trained Universal Sentence Encoder (USE) and Bidirectional Encoder Representations from Transformers (BERT) in a hierarchical setup to capture better representations efficiently. Our proposed models are conceptually simple where we divide the input data into chunks and then pass this through base models of BERT and USE. Then output representation for each chunk is then propagated through a shallow neural network comprising of LSTMs or CNNs for classifying the text data. These extensions are evaluated on 6 benchmark datasets. We show that USE + CNN/LSTM performs better than its stand-alone baseline. Whereas the BERT + CNN/LSTM performs on par with its stand-alone counterpart. However, the hierarchical BERT models are still desirable as it avoids the quadratic complexity of the attention mechanism in BERT. Along with the hierarchical approaches, this work also provides a comparison of different deep learning algorithms like USE, BERT, HAN, Longformer, and BigBird for long document classification. The Longformer approach consistently performs well on most of the datasets.

翻译：文本分类算法调查单词或词组之间的复杂关系,并试图推断文档的解释。在过去几年中,这些算法取得了巨大的进步。变换器和句子编码器已证明在自然语言处理任务上取得了优异的结果。但这些结构的主要局限性在于它们适用于不超过几百字的文本。在本文中, 我们探索了用于长期文档分类的等级转移学习方法。我们使用预先训练的通用编码器( USE) 和来自变换器( BERT) 的双向编码表示法, 以高效地获取更好的表述。我们提议的模型在概念上是简单的, 我们把输入数据分成块, 然后通过 BERT 和 USE 的基模式传递。然后, 每一个块的输出表示法则通过由 LSTMS 或 CNNIS 组成的浅层网络来传播, 用于对文本数据进行分类。这些扩展方法在6个基准数据集中进行了评估。我们显示, USE + CNN/ LSTM 等变换方法比其独立基线要好得多。而 BER/ LSTM 则在最长期的级别比较方法上, 以持续地进行BER 的等级变换代变换代的系统,,, 的更长期的变换代的变换代的变换式的顺序法则则以持续地进行长期的变法,, 的变式的顺序式的顺序式的变式的顺序式的顺序式的变法, 。