案文指南:通过基于特别重要性的文本选择方法改进长文本分类的质量 (Text Guide: Improving the quality of long text classification by a text selection method based on feature importance)

from arxiv, This is the reviewed and accepted for publication version of the article by the IEEE Access Journal. One of the important modifications is publication of the code along with the paper. The code can be used to apply Text Guide to a data set of ones choice. The code is available at: https://github.com/krzysztoffiok/TextGuide

The performance of text classification methods has improved greatly over the last decade for text instances of less than 512 tokens. This limit has been adopted by most state-of-the-research transformer models due to the high computational cost of analyzing longer text instances. To mitigate this problem and to improve classification for longer texts, researchers have sought to resolve the underlying causes of the computational cost and have proposed optimizations for the attention mechanism, which is the key element of every transformer model. In our study, we are not pursuing the ultimate goal of long text classification, i.e., the ability to analyze entire text instances at one time while preserving high performance at a reasonable computational cost. Instead, we propose a text truncation method called Text Guide, in which the original text length is reduced to a predefined limit in a manner that improves performance over naive and semi-naive approaches while preserving low computational costs. Text Guide benefits from the concept of feature importance, a notion from the explainable artificial intelligence domain. We demonstrate that Text Guide can be used to improve the performance of recent language models specifically designed for long text classification, such as Longformer. Moreover, we discovered that parameter optimization is the key to Text Guide performance and must be conducted before the method is deployed. Future experiments may reveal additional benefits provided by this new method.

翻译：过去十年来,文本分类方法的绩效大大提高了512个符号以下的文本实例的绩效;由于分析较长文本实例的计算成本很高,大多数研究状态变压器模型都采用了这一限制;为缓解这一问题和改进较长文本的分类,研究人员力求解决计算成本的根本原因,并提议优化关注机制,这是每个变压器模型的关键要素;在我们的研究中,我们没有追求长文本分类的最终目标,即一次分析整个文本实例的能力,同时以合理的计算成本保持高性能。相反,我们提议了一种称为文本指南的文本脱轨方法,将原来的文本缩短到一个预先确定的限度,其方式是提高计算成本的天性性和半性性性性能,同时保持低的计算成本。文本指南受益于特征重要性的概念,这是可以解释的人工智能域的概念。我们证明,《文本指南》可以用来改进专门为长文本分类设计的近期语言模型的性能,同时保持合理的计算成本。此外,我们提议了一种称为文本指南的文本指南,即文本指南的文本缩短方法将缩短到一个预先界定的限度,这样可以使未来采用的方法得到新的改进。此外,我们发现,通过采用新的方法必须通过采用新的工具来展示新的性能方法来展示新的改进。