Emotion recognition is one of the machine learning applications which can be done using text, speech, or image data gathered from social media spaces. Detecting emotion can help us in different fields, including opinion mining. With the spread of social media, different platforms like Twitter have become data sources, and the language used in these platforms is informal, making the emotion detection task difficult. EmoPars and ArmanEmo are two new human-labeled emotion datasets for the Persian language. These datasets, especially EmoPars, are suffering from inequality between several samples between two classes. In this paper, we evaluate EmoPars and compare them with ArmanEmo. Throughout this analysis, we use data augmentation techniques, data re-sampling, and class-weights with Transformer-based Pretrained Language Models(PLMs) to handle the imbalance problem of these datasets. Moreover, feature selection is used to enhance the models' performance by emphasizing the text's specific features. In addition, we provide a new policy for selecting data from EmoPars, which selects the high-confidence samples; as a result, the model does not see samples that do not have specific emotion during training. Our model reaches a Macro-averaged F1-score of 0.81 and 0.76 on ArmanEmo and EmoPars, respectively, which are new state-of-the-art results in these benchmarks.
翻译:情感识别是使用从社交媒体空间收集的文本、 语音或图像数据可以完成的机器学习应用。 检测情感可以帮助我们在不同领域, 包括见解挖掘。 随着社交媒体的传播, 诸如Twitter等不同平台已成为数据源, 这些平台所使用的语言是非正式的, 使得情感检测任务难上。 EmoPars 和 ArmanEmo是波斯语的两套新的人类标签情感数据集。 这些数据集, 特别是 EmoPars, 正在遭受两个班级之间不同样本之间的不平等。 在本文中, 我们评估EmoParrs 并把它们与 ArmanEmo 相比。 在整个分析中, 我们使用数据增强技术、 数据再采样和类比技术以及基于变异器的预设语言模型( PLMs) 来处理这些数据集的不平衡问题。 此外, 使用地貌选择功能来提高模型的性能, 特别是 EmoPars, 我们提供了从选择高度自信样本的EmoPrs 中选择数据的新政策; 作为结果, 模型不会在F- MA1 的模型中看到具体的样本, 和 Arma- Paseral1 的样本在培训中, 中, SAI- seral 都没有达到特定的样本。