Social networks are widely used for information consumption and dissemination, especially during time-critical events such as natural disasters. Despite its significantly large volume, social media content is often too noisy for direct use in any application. Therefore, it is important to filter, categorize, and concisely summarize the available content to facilitate effective consumption and decision-making. To address such issues automatic classification systems have been developed using supervised modeling approaches, thanks to the earlier efforts on creating labeled datasets. However, existing datasets are limited in different aspects (e.g., size, contains duplicates) and less suitable to support more advanced and data-hungry deep learning models. In this paper, we present a new large-scale dataset with ~77K human-labeled tweets, sampled from a pool of ~24 million tweets across 19 disaster events that happened between 2016 and 2019. Moreover, we propose a data collection and sampling pipeline, which is important for social media data sampling for human annotation. We report multiclass classification results using classic and deep learning (fastText and transformer) based models to set the ground for future studies. The dataset and associated resources are publicly available.\url{https://crisisnlp.qcri.org/humaid_dataset.html}
翻译:社会网络被广泛用于信息消费和传播,特别是在自然灾害等时间紧迫事件期间。尽管数量巨大,社交媒体内容往往过于吵闹,无法直接用于任何应用。因此,必须过滤、分类和简明扼要地概述现有内容,以促进有效的消费和决策。为了解决这些问题,已利用监督的模型方法开发了自动分类系统,此前曾努力创建标签数据集,因此,创建有标签的数据集。然而,现有数据集在不同方面(如规模、含有复制品)有限,更不适合支持更先进和数据饥饿的深层学习模式。在本文件中,我们展示了一个新的大型数据集,使用~77K 人类标签的推特,从2016至2019年发生的19起灾害事件中收集了大约2 400万个推特。此外,我们提议建立一个数据收集和取样管道,这对于社会媒体数据取样对人类认知十分重要。我们报告以经典和深层次学习(快速图和变异模型)为基础的多级分类结果,以建立未来研究的地面模型。数据集和相关资源可公开获得。httpsetetset and commexqurqal data_qal data@qdaldaldalmagistrat}