波斯用户产生的文字内容的挑战:以机械学习为基础的方法 (The Challenges of Persian User-generated Textual Content: A Machine Learning-Based Approach)

Over recent years a lot of research papers and studies have been published on the development of effective approaches that benefit from a large amount of user-generated content and build intelligent predictive models on top of them. This research applies machine learning-based approaches to tackle the hurdles that come with Persian user-generated textual content. Unfortunately, there is still inadequate research in exploiting machine learning approaches to classify/cluster Persian text. Further, analyzing Persian text suffers from a lack of resources; specifically from datasets and text manipulation tools. Since the syntax and semantics of the Persian language is different from English and other languages, the available resources from these languages are not instantly usable for Persian. In addition, recognition of nouns and pronouns, parts of speech tagging, finding words' boundary, stemming or character manipulations for Persian language are still unsolved issues that require further studying. Therefore, efforts have been made in this research to address some of the challenges. This presented approach uses a machine-translated datasets to conduct sentiment analysis for the Persian language. Finally, the dataset has been rehearsed with different classifiers and feature engineering approaches. The results of the experiments have shown promising state-of-the-art performance in contrast to the previous efforts; the best classifier was Support Vector Machines which achieved a precision of 91.22%, recall of 91.71%, and F1 score of 91.46%.

翻译：近些年来,已经发表了大量研究论文和研究报告,内容涉及如何制定有效方法,从大量用户生成的内容中受益,并在上面建立智能预测模型。这项研究运用机器学习方法,解决波斯用户生成的文本内容带来的障碍。不幸的是,在利用机器学习方法对波斯文本进行分类/集群方面,研究仍然不够充分。此外,分析波斯文本缺乏资源,具体来自数据集和文本处理工具。由于波斯语的语法和语义不同于英语和其他语言,因此这些语言的现有资源不能立即用于波斯语。此外,对名词和代词的识别、语言标记、语言界限、波斯语言的断层或字符操控等部分的识别,仍然是有待进一步研究的未解问题。因此,在这项研究中已经努力应对一些挑战。提出的方法使用了机器翻译的数据集,1 用于对波斯语进行情绪分析。最后,数据集经过不同分类和地貌工程学的预选用资源排练。此外,对名词和标语的标语学方法部分、标语的标本部分、查找界限的界限部分、对波斯语言的精确度的精确度测试结果显示有希望的成绩。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

【深度学习社区检测】Deep Learning for Community Detection: Progress, Challenges and Opportunities

专知会员服务

28+阅读 · 2020年6月13日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日