Text classification is the most fundamental and essential task in natural language processing. The last decade has seen a surge of research in this area due to the unprecedented success of deep learning. Numerous methods, datasets, and evaluation metrics have been proposed in the literature, raising the need for a comprehensive and updated survey. This paper fills the gap by reviewing the state of the art approaches from 1961 to 2020, focusing on models from shallow to deep learning. We create a taxonomy for text classification according to the text involved and the models used for feature extraction and classification. We then discuss each of these categories in detail, dealing with both the technical developments and benchmark datasets that support tests of predictions. A comprehensive comparison between different techniques, as well as identifying the pros and cons of various evaluation metrics are also provided in this survey. Finally, we conclude by summarizing key implications, future research directions, and the challenges facing the research area.
翻译:文本分类是自然语言处理的最根本和最基本的任务。过去十年,由于深层次学习的空前成功,这一领域的研究激增。文献中提出了许多方法、数据集和评价指标,提高了进行全面和更新调查的必要性。本文件通过审查1961年至2020年最新方法填补了差距,重点是浅度至深层学习的模型。我们根据所涉文本和用于特征提取和分类的模型,为文本分类创建了分类系统。我们随后详细讨论了其中每一类,既涉及技术发展,又涉及支持预测测试的基准数据集。本次调查还全面比较了不同技术,并查明了各种评估指标的利弊。最后,我们总结了研究领域面临的关键影响、未来研究方向和挑战。