Africa is home to over 2000 languages from over six language families and has the highest linguistic diversity among all continents. This includes 75 languages with at least one million speakers each. Yet, there is little NLP research conducted on African languages. Crucial in enabling such research is the availability of high-quality annotated datasets. In this paper, we introduce AfriSenti, which consists of 14 sentiment datasets of 110,000+ tweets in 14 African languages (Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oromo, Swahili, Tigrinya, Twi, Xitsonga, and Yor\`ub\'a) from four language families annotated by native speakers. The data is used in SemEval 2023 Task 12, the first Afro-centric SemEval shared task. We describe the data collection methodology, annotation process, and related challenges when curating each of the datasets. We conduct experiments with different sentiment classification baselines and discuss their usefulness. We hope AfriSenti enables new work on under-represented languages. The dataset is available at https://github.com/afrisenti-semeval/afrisent-semeval-2023 and can also be loaded as a huggingface datasets (https://huggingface.co/datasets/shmuhammad/AfriSenti).
翻译:非洲有6个以上的语言家庭,拥有2000多种非洲语言,在所有大陆中语言多样化程度最高,包括75种语言,每个语言至少有100万,然而,非洲语言方案在非洲语言方面开展的研究很少,这种研究的关键在于能否提供高质量的附加说明的数据集。在本文中,我们采用了AfriSenti, 其中包括14个以14种非洲语言(阿姆哈拉语、阿尔及利亚阿拉伯语、豪萨语、伊格博语、基尼亚卢旺达语、摩洛哥阿拉伯语、莫桑比克葡萄牙语、尼日利亚皮德金语、奥罗莫语、斯瓦希里语、提格里亚语、特维语、希特森加语和约尔 ⁇ 乌伯语)提供的14个情感数据集,共110 000+推特,以14种非洲语言(阿姆哈拉语、阿尔及利亚阿拉伯语、豪萨语、伊格穆尔语、基亚语、摩洛哥阿拉伯语、莫桑比克葡萄牙语、尼日利亚皮德金语、奥罗莫罗莫、斯瓦利语、提格里里亚语、特林亚语、特维语、希特森加语和约尔乌乌乌巴语)的4个语言家庭,这些用户进行了研究研究。Semevus/commas/commas/commeving提供新的数据。