Emotional Analysis from textual input has been considered both a challenging and interesting task in Natural Language Processing. However, due to the lack of datasets in low-resource languages (i.e. Tamil), it is difficult to conduct research of high standard in this area. Therefore we introduce this labelled dataset (a largest manually annotated dataset of more than 42k Tamil YouTube comments, labelled for 31 emotions including neutral) for emotion recognition. The goal of this dataset is to improve emotion detection in multiple downstream tasks in Tamil. We have also created three different groupings of our emotions (3-class, 7-class and 31-class) and evaluated the model's performance on each category of the grouping. Our MURIL-base model has achieved a 0.60 macro average F1-score across our 3-class group dataset. With 7-class and 31-class groups, the Random Forest model performed well with a macro average F1-scores of 0.42 and 0.29 respectively.
翻译:从文字输入中得出的情感分析在自然语言处理中被认为是一项富有挑战性和有趣的任务,然而,由于缺乏低资源语言(即泰米尔语)的数据集,很难进行这方面的高标准研究。因此,我们引入了这个标有标签的数据集(一个最大的人工附加注释的数据集,有42个以上的泰米尔人YouTube评论,有31个情感标签,包括中性的)。这个数据集的目标是在泰米尔语的多个下游任务中改进情感检测。我们还创建了三种不同的情感组(3类、7类和31类),并评估了该模型在每一类组合中的性能。我们的MURIL(MURIL)模型在3级组数据集中取得了0.60个宏观平均F1分数。在7类和31类组中,随机森林模型的运行良好,其宏观平均F1分数分别为0.42和0.29。