长文件分类比较研究 (Comparative Study of Long Document Classification)

The amount of information stored in the form of documents on the internet has been increasing rapidly. Thus it has become a necessity to organize and maintain these documents in an optimum manner. Text classification algorithms study the complex relationships between words in a text and try to interpret the semantics of the document. These algorithms have evolved significantly in the past few years. There has been a lot of progress from simple machine learning algorithms to transformer-based architectures. However, existing literature has analyzed different approaches on different data sets thus making it difficult to compare the performance of machine learning algorithms. In this work, we revisit long document classification using standard machine learning approaches. We benchmark approaches ranging from simple Naive Bayes to complex BERT on six standard text classification datasets. We present an exhaustive comparison of different algorithms on a range of long document datasets. We re-iterate that long document classification is a simpler task and even basic algorithms perform competitively with BERT-based approaches on most of the datasets. The BERT-based models perform consistently well on all the datasets and can be blindly used for the document classification task when the computations cost is not a concern. In the shallow model's category, we suggest the usage of raw BiLSTM + Max architecture which performs decently across all the datasets. Even simpler Glove + Attention bag of words model can be utilized for simpler use cases. The importance of using sophisticated models is clearly visible in the IMDB sentiment dataset which is a comparatively harder task.

翻译：以互联网文件形式储存的信息数量一直在迅速增加。因此,现在有必要以最佳的方式组织和保存这些文件。文本分类算法研究文本中文字之间的复杂关系,并试图解释文件的语义。这些算法在过去几年中发生了很大变化。从简单的机器学习算法到基于变压器的架构,取得了许多进展。然而,现有的文献分析了不同数据集的不同方法,因此难以比较机器学习算法的性能。在这项工作中,我们利用标准的机器学习方法重新审视长的文件分类。我们用6个标准文本分类数据集来衡量从简单的Nive Bayes到复杂的BERT的复杂关系。我们在一系列长的文件数据集上对不同的算法进行了详尽的比较。我们再次指出,长的文件分类是一项比较简单的任务,甚至基本的算法在大多数数据集上都与基于BERT的方法竞争。基于BERT的模型在所有数据集中都保持一贯的精确度,并且可以盲目地用于文件分类任务。当计算BERT的原始模型使用成本时,我们用的是更简单的BERS的模型, 而不是最浅的数据。