Text document classification is an important task for diverse natural language processing based applications. Traditional machine learning approaches mainly focused on reducing dimensionality of textual data to perform classification. This although improved the overall classification accuracy, the classifiers still faced sparsity problem due to lack of better data representation techniques. Deep learning based text document classification, on the other hand, benefitted greatly from the invention of word embeddings that have solved the sparsity problem and researchers focus mainly remained on the development of deep architectures. Deeper architectures, however, learn some redundant features that limit the performance of deep learning based solutions. In this paper, we propose a two stage text document classification methodology which combines traditional feature engineering with automatic feature engineering (using deep learning). The proposed methodology comprises a filter based feature selection (FSE) algorithm followed by a deep convolutional neural network. This methodology is evaluated on the two most commonly used public datasets, i.e., 20 Newsgroups data and BBC news data. Evaluation results reveal that the proposed methodology outperforms the state-of-the-art of both the (traditional) machine learning and deep learning based text document classification methodologies with a significant margin of 7.7% on 20 Newsgroups and 6.6% on BBC news datasets.
翻译:传统机器学习方法主要侧重于减少文本数据的多元性,以进行分类。尽管这提高了总体分类准确性,但分类者由于缺少更好的数据表述技术而面临宽度问题。深层次学习基于文本文件的分类,另一方面,从创建解决了广度问题的文字嵌入器中获益良多,研究人员主要关注深层结构的发展。深层结构学习了限制深层学习解决方案绩效的一些冗余特征。在本文件中,我们提议了两个阶段的文本分类方法,将传统特征工程与自动特征工程结合起来(利用深层学习)。拟议方法包括基于过滤的特征选择算法,然后是深层进化神经网络。这种方法主要从最常用的两种公共数据集,即20个新闻组数据和英国广播公司新闻数据中得到评价。评价结果显示,拟议的方法超越了(传统)机器学习和深层学习基础文本文件分类法的状态。在BBBC 20 % 和基于文本文件分类法的重要比例上,在BBCSALS 20 % 上, 和基于文本文件分类方法的重要比例。