Text classification is the most fundamental and essential task in natural language processing. The last decade has seen a surge of research in this area due to the unprecedented success of deep learning. Numerous methods, datasets, and evaluation metrics have been proposed in the literature, raising the need for a comprehensive and updated survey. This paper fills the gap by reviewing the state-of-the-art approaches from 1961 to 2021, focusing on models from traditional models to deep learning. We create a taxonomy for text classification according to the text involved and the models used for feature extraction and classification. We then discuss each of these categories in detail, dealing with both the technical developments and benchmark datasets that support tests of predictions. A comprehensive comparison between different techniques, as well as identifying the pros and cons of various evaluation metrics are also provided in this survey. Finally, we conclude by summarizing key implications, future research directions, and the challenges facing the research area.
翻译:案文分类是自然语言处理的最根本和最基本的任务。过去十年,由于深层次学习取得前所未有的成功,这一领域的研究激增。文献中提出了许多方法、数据集和评价指标,提高了进行全面和更新调查的必要性。本文件通过审查1961年至2021年的最新方法填补了这一空白,重点是传统模式到深层次学习的模型。我们根据所涉文本和用于地物提取和分类的模型,为文本分类创建了分类系统。我们随后详细讨论了其中每一类,既涉及技术发展,又涉及支持预测测试的基准数据集。本次调查还全面比较了不同技术,并查明了各种评价指标的利弊。最后,我们总结了研究领域面临的主要影响、未来研究方向和挑战。