Sentiment analysis is one of the most widely studied applications in NLP, but most work focuses on languages with large amounts of data. We introduce the first large-scale human-annotated Twitter sentiment dataset for the four most widely spoken languages in Nigeria (Hausa, Igbo, Nigerian-Pidgin, and Yoruba) consisting of around 30,000 annotated tweets per language (except for Nigerian-Pidgin), including a significant fraction of code-mixed tweets. We propose text collection, filtering, processing, and labelling methods that enable us to create datasets for these low-resource languages. We evaluate a range of pre-trained models and transfer strategies on the dataset. We find that language-specific models and language-adaptive fine-tuning generally perform best. We release the datasets, trained models, sentiment lexicons, and code to incentivize research on sentiment analysis in under-represented languages.
翻译:感官分析是国家语言平台中研究最广泛的应用之一,但大多数工作都侧重于有大量数据的语文。我们为尼日利亚四种最广泛使用的语言(豪萨语、伊博语、尼日利亚皮金语和约鲁巴语)引入了第一批大规模具有人文附加说明的Twitter情绪数据集,其中包括每种语言约30,000个附加说明的推文(尼日利亚语除外),包括大量混合代码的推文。我们建议了文本收集、过滤、处理和标签方法,使我们能够为这些低资源语言创建数据集。我们评估了一套预先培训的模型和数据集传输战略。我们发现,语言特有模式和语言适应性微调一般效果最佳。我们发布了数据集、经过培训的模型、情绪法理学和代码,以激励对代表不足的语言进行情绪分析的研究。