Sentiment analysis is a text mining task that determines the polarity of a given text, i.e., its positiveness or negativeness. Recently, it has received a lot of attention given the interest in opinion mining in micro-blogging platforms. These new forms of textual expressions present new challenges to analyze text given the use of slang, orthographic and grammatical errors, among others. Along with these challenges, a practical sentiment classifier should be able to handle efficiently large workloads. The aim of this research is to identify which text transformations (lemmatization, stemming, entity removal, among others), tokenizers (e.g., words $n$-grams), and tokens weighting schemes impact the most the accuracy of a classifier (Support Vector Machine) trained on two Spanish corpus. The methodology used is to exhaustively analyze all the combinations of the text transformations and their respective parameters to find out which characteristics the best performing classifiers have in common. Furthermore, among the different text transformations studied, we introduce a novel approach based on the combination of word based $n$-grams and character based $q$-grams. The results show that this novel combination of words and characters produces a classifier that outperforms the traditional word based combination by $11.17\%$ and $5.62\%$ on the INEGI and TASS'15 dataset, respectively.
翻译:感官分析是一项确定特定文本的极性, 即其正或负性的文字挖掘任务。 最近,由于对微博客平台的意见挖掘感兴趣, 它受到了很多关注。 这些新形式的文字表达形式对分析文本提出了新的挑战, 包括使用语、 笔和语法错误。 除了这些挑战之外, 一个实用的情绪分类器应该能够高效地处理大量的工作量。 这项研究的目的是确定文本转换( 语言化、 源、 实体删除等) 、 符号化器( 例如, 美元- 克) 和符号加权方案对两个西班牙文中受过训练的分类器( 支持矢量机) 的最准确性产生了新的挑战。 使用的方法是详尽地分析文本转换的所有组合以及它们各自的参数, 以找出最优秀的分类器具有共同的特点。 此外, 在所研究的不同文本转换中, 我们采用了一种新颖的方法, 以基于 $- 美元( 美元- gram) 的单词组合, T625 和字符加权组合, 以 美元- glas 和 美元- glas 和 美元- greal 的字符组合 。