Social media is often utilized as a lifeline for communication during natural disasters. Traditionally, natural disaster tweets are filtered from the Twitter stream using the name of the natural disaster and the filtered tweets are sent for human annotation. The process of human annotation to create labeled sets for machine learning models is laborious, time consuming, at times inaccurate, and more importantly not scalable in terms of size and real-time use. In this work, we curate a silver standard dataset using weak supervision. In order to validate its utility, we train machine learning models on the weakly supervised data to identify three different types of natural disasters i.e earthquakes, hurricanes and floods. Our results demonstrate that models trained on the silver standard dataset achieved performance greater than 90% when classifying a manually curated, gold-standard dataset. To enable reproducible research and additional downstream utility, we release the silver standard dataset for the scientific community.
翻译:在自然灾害期间,通常利用社会媒体作为生命线进行通信。 传统上,自然灾害推文是用自然灾害的名称从推特流中过滤的,过滤后的推文是给人的注解。 人类为机器学习模型制作贴标签的数据集的批注过程非常费力、耗时、有时不准确,更重要的是,从规模和实时使用方面来说,无法进行缩放。 在这项工作中,我们利用薄弱的监督力来建立一个银标准数据集。 为了验证其实用性,我们用监督不力的数据来培训机器学习模型,以查明三种不同类型的自然灾害,即地震、飓风和洪水。我们的结果表明,在银标准数据集上培训的模型在对手工拼凑的金标准数据集进行分类时,其性能超过90%。 为了能够进行再生研究和增加下游效用,我们为科学界发布了银标准数据集。