This work investigates the application of sampling methods for sentiment analysis on two different highly imbalanced datasets. One dataset contains online user reviews from the cooking platform Epicurious and the other contains comments given to the Planned Parenthood organization. In both these datasets, the classes of interest are rare. Word n-grams were used as features from these datasets. A feature selection technique based on information gain is first applied to reduce the number of features to a manageable space. A number of different sampling methods were then applied to mitigate the class imbalance problem which are then analyzed.
翻译:这项工作调查了对两个高度不平衡的数据集进行情绪分析的抽样方法的应用情况。一个数据集包含烹饪平台Epicuricous的在线用户审查,另一个数据集包含对计划生育组织的评论。在这两个数据集中,感兴趣的类别是罕见的。这些数据集中使用了单词 n 克作为字词。首先,根据信息收益选择特征技术,将特征数量减少到一个可控制的空间。然后,采用了一些不同的抽样方法来缓解分类不平衡问题,然后对分类不平衡问题进行分析。