In recent years, there has been an exponential growth in the number of complex documents and texts that require a deeper understanding of machine learning methods to be able to accurately classify texts in many applications. Many machine learning approaches have achieved surpassing results in natural language processing. The success of these learning algorithms relies on their capacity to understand complex models and non-linear relationships within data. However, finding suitable structures, architectures, and techniques for text classification is a challenge for researchers. In this paper, a brief overview of text classification algorithms is discussed. This overview covers different text feature extractions, dimensionality reduction methods, existing algorithms and techniques, and evaluations methods. Finally, the limitations of each technique and their application in the real-world problem are discussed.
翻译:近年来,复杂的文件和文本数量呈指数增长趋势,需要更深入地了解机器学习方法,才能对许多应用程序的文本进行准确分类。许多机器学习方法在自然语言处理方面取得了超乎寻常的成果。这些学习算法的成功取决于其理解复杂模型和数据中非线性关系的能力。然而,寻找适当的结构、架构和文本分类技术对研究人员来说是一个挑战。本文讨论了文本分类算法的简要概览。本概览涵盖不同的文本特征提取、维度减少方法、现有算法和技术以及评估方法。最后,讨论了每种技术的局限性及其在现实世界问题中的应用。