MasakhaNEWS: News Topic Classification for African languages

David Ifeoluwa Adelani,Marek Masiak,Israel Abebe Azime,Jesujoba Alabi,Atnafu Lambebo Tonja,Christine Mwase,Odunayo Ogundepo,Bonaventure F. P. Dossou,Akintunde Oladipo,Doreen Nixdorf,Chris Chinenye Emezue,sana al-azzawi,Blessing Sibanda,Davis David,Lolwethu Ndolela,Jonathan Mukiibi,Tunde Ajayi,Tatiana Moteu,Brian Odhiambo,Abraham Owodunni,Nnaemeka Obiefuna,Muhidin Mohamed,Shamsuddeen Hassan Muhammad,Teshome Mulugeta Ababu,Saheed Abdullahi Salahudeen,Mesay Gemeda Yigezu,Tajuddeen Gwadabe,Idris Abdulmumin,Mahlet Taye,Oluwabusayo Awoyomi,Iyanuoluwa Shode,Tolulope Adelani,Habiba Abdulganiyu,Abdul-Hakeem Omotayo,Adetola Adeeko,Abeeb Afolabi,Anuoluwapo Aremu,Olanrewaju Samuel,Clemencia Siro,Wangari Kimotho,Onyekachi Ogbu,Chinedu Mbonu,Chiamaka Chukwuneke,Samuel Fanijo,Jessica Ojo,Oyinkansola Awosan,Tadesse Kebede,Toadoum Sari Sakayo,Pamela Nyatsine,Freedmore Sidume,Oreen Yousuf,Mardiyyah Oduwole,Tshinu Tshinu,Ussen Kimanuka,Thina Diko,Siyanda Nxakama,Sinodos Nigusse,Abdulmejid Johar,Shafie Mohamed,Fuad Mire Hassan,Moges Ahmed Mehamed,Evrard Ngabire,Jules Jules,Ivan Ssenkungu,Pontus Stenetorp

from arxiv, Accepted to IJCNLP-AACL 2023 (main conference)

African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS -- a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa. We provide an evaluation of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning such as cross-lingual parameter-efficient fine-tuning (like MAD-X), pattern exploiting training (PET), prompting language models (like ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API). Our evaluation in zero-shot setting shows the potential of prompting ChatGPT for news topic classification in low-resource African languages, achieving an average performance of 70 F1 points without leveraging additional supervision like MAD-X. In few-shot setting, we show that with as little as 10 examples per label, we achieved more than 90\% (i.e. 86.0 F1 points) of the performance of full supervised training (92.6 F1 points) leveraging the PET approach.

翻译：暂无翻译

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日