Social media such as Twitter provide valuable information to crisis managers and affected people during natural disasters. Machine learning can help structure and extract information from the large volume of messages shared during a crisis; however, the constantly evolving nature of crises makes effective domain adaptation essential. Supervised classification is limited by unchangeable class labels that may not be relevant to new events, and unsupervised topic modelling by insufficient prior knowledge. In this paper, we bridge the gap between the two and show that BERT embeddings finetuned on crisis-related tweet classification can effectively be used to adapt to a new crisis, discovering novel topics while preserving relevant classes from supervised training, and leveraging bidirectional self-attention to extract topic keywords. We create a dataset of tweets from a snowstorm to evaluate our method's transferability to new crises, and find that it outperforms traditional topic models in both automatic, and human evaluations grounded in the needs of crisis managers. More broadly, our method can be used for textual domain adaptation where the latent classes are unknown but overlap with known classes from other domains.
翻译:诸如Twitter等社交媒体在自然灾害期间为危机管理者和灾民提供宝贵信息。 机器学习可以帮助构建和从危机期间共享的大量信息中提取信息; 但是,危机的不断变化性质使得有效的领域适应至关重要。 受监督的分类受到以下因素的限制:可能与新事件无关的不可改变的阶级标签,以及先前知识不足的未经监督的专题模型。 在本文中,我们缩小两者之间的鸿沟,并显示BERT在与危机有关的推特分类上进行微调,可以有效地用于适应新的危机,发现新话题,同时保留受监督的培训中的有关课程,并利用双向自我意识来提取主题关键词。 我们制作了一组来自暴风雪的推特数据,以评价我们的方法向新危机转移的可能性,并发现它超越了自动和基于危机管理者需要的人类评价的传统主题模式。 更广泛地说,我们的方法可以用于文字域的适应,即潜在类别未知但与其他领域已知类别重叠。