The recent literature in text classification is biased towards short text sequences (e.g., sentences or paragraphs). In real-world applications, multi-page multi-paragraph documents are common and they cannot be efficiently encoded by vanilla Transformer-based models. We compare different Transformer-based Long Document Classification (TrLDC) approaches that aim to mitigate the computational overhead of vanilla transformers to encode much longer text, namely sparse attention and hierarchical encoding methods. We examine several aspects of sparse attention (e.g., size of local attention window, use of global attention) and hierarchical (e.g., document splitting strategy) transformers on four document classification datasets covering different domains. We observe a clear benefit from being able to process longer text, and, based on our results, we derive practical advice of applying Transformer-based models on long document classification tasks.
翻译:文本分类方面最近的文献偏向于短文本序列(如句子或段落)。在现实世界的应用中,多页多段文件是常见的,无法用香草变异器模型有效地编码。我们比较了不同基于变异器的长文档分类方法,这些方法旨在减轻香草变异器的计算间接费用,而编码更长的文本,即注意力稀少和等级编码方法。我们研究了在涵盖不同领域的四个文件分类数据集上很少注意的几个方面(如当地注意窗口的大小、全球注意的利用)和等级(如文件分离战略)变异器。我们观察到能够处理较长文本的明显好处,并根据我们的结果,我们提出了关于应用变异器模型进行长文件分类任务的实用建议。