Sentiment analysis is one of the most widely studied applications in NLP, but most work focuses on languages with large amounts of data. We introduce the first large-scale human-annotated Twitter sentiment dataset for the four most widely spoken languages in Nigeria (Hausa, Igbo, Nigerian-Pidgin, and Yor\`ub\'a ) consisting of around 30,000 annotated tweets per language (and 14,000 for Nigerian-Pidgin), including a significant fraction of code-mixed tweets. We propose text collection, filtering, processing and labeling methods that enable us to create datasets for these low-resource languages. We evaluate a rangeof pre-trained models and transfer strategies on the dataset. We find that language-specific models and language-adaptivefine-tuning generally perform best. We release the datasets, trained models, sentiment lexicons, and code to incentivizeresearch on sentiment analysis in under-represented languages.
翻译:感官分析是国家语言平台中研究最广泛的应用之一,但大部分工作侧重于有大量数据的语言。我们为尼日利亚四种最广泛使用的语言(豪萨语、伊格博语、尼日利亚皮金语和Yor ⁇ ub\'a)引入了第一批规模庞大的带有附加说明的Twitter情绪数据集,其中包括每种语言约30,000条附加说明的推文(尼日利亚语和皮金语为14,000条),包括大量代码混合的推文。我们建议了文本收集、过滤、处理和标签方法,使我们能够为这些低资源语言创建数据集。我们评估了一组预先培训的模型和数据集的传输战略。我们发现,语言特定模式和语言适应性调整一般表现最佳。我们发布了数据集、经过培训的模型、情绪词汇和代码,以代表不足的语言激励对情绪分析的研究。