The exponential growth of data generated on the Internet in the current information age is a driving force for the digital economy. Extraction of information is the major value in an accumulated big data. Big data dependency on statistical analysis and hand-engineered rules machine learning algorithms are overwhelmed with vast complexities inherent in human languages. Natural Language Processing (NLP) is equipping machines to understand these human diverse and complicated languages. Text Classification is an NLP task which automatically identifies patterns based on predefined or undefined labeled sets. Common text classification application includes information retrieval, modeling news topic, theme extraction, sentiment analysis, and spam detection. In texts, some sequences of words depend on the previous or next word sequences to make full meaning; this is a challenging dependency task that requires the machine to be able to store some previous important information to impact future meaning. Sequence models such as RNN, GRU, and LSTM is a breakthrough for tasks with long-range dependencies. As such, we applied these models to Binary and Multi-class classification. Results generated were excellent with most of the models performing within the range of 80% and 94%. However, this result is not exhaustive as we believe there is room for improvement if machines are to compete with humans.
翻译:在当前信息时代,互联网上生成的数据的指数增长是数字经济的驱动力。 信息提取是所积累的大数据的主要价值。 依赖统计分析和手工设计规则的机器学习算法的巨大数据依赖于人类语言所固有的极其复杂的复杂因素。 自然语言处理( NLP) 正在装备机器来理解这些人类多样性和复杂语言。 文本分类是一项NLP任务,它自动识别基于预先定义或未经定义的标签数据集的模式。 通用文本分类应用包括信息检索、 建模新闻主题、主题提取、情绪分析以及垃圾检测。 在文本中, 某些词的顺序取决于先前或下一个字序列, 才能产生充分的意义; 这是一项具有挑战性的依赖性的任务, 需要机器能够储存一些先前的重要信息, 才能影响未来的意义。 序列模型如 RNN、 GRU 和 LSTM 是长期依赖性任务的一个突破。 因此, 我们将这些模型应用于Binary 和多级分类。 所产生的结果非常出色, 大多数模型在80% 和 94% 的范围内运行, 然而, 与人类的改进结果并非详尽无穷无穷无穷。