Text classification is an important task in Natural Language Processing (NLP), where the goal is to categorize text data into predefined classes. In this study, we analyse the dataset creation steps and evaluation techniques of multi-label news categorisation task as part of text classification. We first present a newly obtained dataset for Uzbek text classification, which was collected from 10 different news and press websites and covers 15 categories of news, press and law texts. We also present a comprehensive evaluation of different models, ranging from traditional bag-of-words models to deep learning architectures, on this newly created dataset. Our experiments show that the Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) based models outperform the rule-based models. The best performance is achieved by the BERTbek model, which is a transformer-based BERT model trained on the Uzbek corpus. Our findings provide a good baseline for further research in Uzbek text classification.
翻译:在自然语言处理(NLP)中,文本分类是一项重要任务,目标是将文本数据分类为预先定义的类别。在本研究中,我们分析多标签新闻分类任务的数据集创建步骤和评价技术,作为文本分类的一部分。我们首先为乌兹别克文本分类提供新获得的数据集,该数据集来自10个不同的新闻和新闻网站,涵盖15类新闻、新闻和法律文本。我们还全面评价了不同模型,从传统的词包模型到这个新创建的数据集的深层学习结构。我们的实验显示,基于经常性神经网络(RNN)和革命神经网络(CNN)的模型超越了以规则为基础的模型。最佳性能来自BERTbek模型,这是以变压器为基础的、以乌兹别克文为培训的BERT模型。我们的调查结果为乌兹别克斯坦文本分类的进一步研究提供了良好的基准。</s>