People rely on news to know what is happening around the world and inform their daily lives. In today's world, when the proliferation of fake news is rampant, having a large-scale and high-quality source of authentic news articles with the published category information is valuable to learning authentic news' Natural Language syntax and semantics. As part of this work, we present a News Category Dataset that contains around 210k news headlines from the year 2012 to 2022 obtained from HuffPost, along with useful metadata to enable various NLP tasks. In this paper, we also produce some novel insights from the dataset and describe various existing and potential applications of our dataset.
翻译:人们依靠新闻了解世界各地正在发生的事情,并告知他们的日常生活。在当今世界,当假新闻泛滥时,拥有大量高质量的真实新闻文章来源,并公布分类信息,对于学习真实新闻的自然语言语法和语义很有价值。作为这项工作的一部分,我们推出一个新闻分类数据集,包含2012年至2022年从HuffPost获得的大约210k条新闻头条新闻,以及有用的元数据,以完成各种NLP任务。在本文中,我们还从数据集中提供一些新颖的见解,并描述我们数据集的各种现有和潜在应用。