Potrika:有8个主题和5个属性的孟加拉语原始和平衡的报纸数据集 (Potrika: Raw and Balanced Newspaper Datasets in the Bangla Language with Eight Topics and Five Attributes)

Knowledge is central to human and scientific developments. Natural Language Processing (NLP) allows automated analysis and creation of knowledge. Data is a crucial NLP and machine learning ingredient. The scarcity of open datasets is a well-known problem in machine and deep learning research. This is very much the case for textual NLP datasets in English and other major world languages. For the Bangla language, the situation is even more challenging and the number of large datasets for NLP research is practically nil. We hereby present Potrika, a large single-label Bangla news article textual dataset curated for NLP research from six popular online news portals in Bangladesh (Jugantor, Jaijaidin, Ittefaq, Kaler Kontho, Inqilab, and Somoyer Alo) for the period 2014-2020. The articles are classified into eight distinct categories (National, Sports, International, Entertainment, Economy, Education, Politics, and Science \& Technology) providing five attributes (News Article, Category, Headline, Publication Date, and Newspaper Source). The raw dataset contains 185.51 million words and 12.57 million sentences contained in 664,880 news articles. Moreover, using NLP augmentation techniques, we create from the raw (unbalanced) dataset another (balanced) dataset comprising 320,000 news articles with 40,000 articles in each of the eight news categories. Potrika contains both the datasets (raw and balanced) to suit a wide range of NLP research. By far, to the best of our knowledge, Potrika is the largest and the most extensive dataset for news classification.

翻译：自然语言处理(NLP)允许自动分析和创造知识。数据是一个关键的NLP和机器学习要素。开放数据集的稀缺是机器和深层学习研究中众所周知的一个问题。英语和其他主要世界语言的文本NLP数据集正是如此。孟加拉语的形势甚至更具挑战性,国家语言研究的大型数据集数量几乎为零。我们在此介绍Potrika,这是孟加拉国六个流行的在线新闻门户(Jugantor、Jaijaidin、Ittefaq、Kaler Kontho、Inqilab和Somoyer Alo)为NLP研究制作的大型单标签文章文本数据集。2014-2020年期间,对英语和其他主要世界语言文本的NLP数据集非常普遍。关于孟加拉语的文章分为8个不同类别(国家、体育、国际、娱乐、经济、教育、政治和科学等科技)提供了五种最均衡的属性(News、Celectine、头条、出版日期和报纸来源),原始数据包含185100万种数据,而原始数据则包括我们新闻的1850万种。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【MIT Sam Hopkins】如何读论文？How to Read a Paper

专知会员服务

108+阅读 · 2022年3月20日

【超赞的#C++#速查&信息图】“hacking c++ - Cheat Sheets & Infographics”

专知会员服务

30+阅读 · 2022年3月8日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日