In NLP, text classification is one of the primary problems we try to solve and its uses in language analyses are indisputable. The lack of labeled training data made it harder to do these tasks in low resource languages like Amharic. The task of collecting, labeling, annotating, and making valuable this kind of data will encourage junior researchers, schools, and machine learning practitioners to implement existing classification models in their language. In this short paper, we aim to introduce the Amharic text classification dataset that consists of more than 50k news articles that were categorized into 6 classes. This dataset is made available with easy baseline performances to encourage studies and better performance experiments.
翻译:在《国家劳工政策》中,文本分类是我们试图解决的主要问题之一,语言分析中使用的文本分类是无可争议的。由于缺乏贴标签的培训数据,难以用阿姆哈拉语等低资源语言完成这些任务。收集、贴标签、注解和提供宝贵的这类数据的任务将鼓励初级研究人员、学校和机器学习实践者用他们的语言实施现有的分类模式。在这个简短的文件中,我们的目标是引入阿姆哈拉语文本分类数据集,该数据集由50多篇新闻文章组成,分为6类。该数据集以易于获得的基线性能鼓励研究和更好的性能实验。