Social media data has emerged as a useful source of timely information about real-world crisis events. One of the main tasks related to the use of social media for disaster management is the automatic identification of crisis-related messages. Most of the studies on this topic have focused on the analysis of data for a particular type of event in a specific language. This limits the possibility of generalizing existing approaches because models cannot be directly applied to new types of events or other languages. In this work, we study the task of automatically classifying messages that are related to crisis events by leveraging cross-language and cross-domain labeled data. Our goal is to make use of labeled data from high-resource languages to classify messages from other (low-resource) languages and/or of new (previously unseen) types of crisis situations. For our study we consolidated from the literature a large unified dataset containing multiple crisis events and languages. Our empirical findings show that it is indeed possible to leverage data from crisis events in English to classify the same type of event in other languages, such as Spanish and Italian (80.0% F1-score). Furthermore, we achieve good performance for the cross-domain task (80.0% F1-score) in a cross-lingual setting. Overall, our work contributes to improving the data scarcity problem that is so important for multilingual crisis classification. In particular, mitigating cold-start situations in emergency events, when time is of essence.
翻译:社会媒体数据已成为有关现实世界危机事件的及时及时信息的一个有用来源。与使用社交媒体进行灾害管理有关的主要任务之一是自动识别与危机有关的信息。关于这一专题的大多数研究都侧重于分析特定类型事件的数据,具体语言为特定类型事件的数据。这限制了推广现有方法的可能性,因为模型不能直接应用于新型事件或其他语言。在这项工作中,我们研究利用跨语言和跨主题标签数据对与危机事件有关的信息进行自动分类的任务。我们的目标是利用高资源语言的标记数据,从其他(低资源)语言和/或新的(以前看不见的)危机局势类型中对信息进行分类。我们的研究从文献中整合了包含多重危机事件和语言的大型统一数据集。我们的实证研究结果表明,确实有可能利用来自危机事件的数据,用其他语言(如西班牙语和意大利语(8.0.0% F1)对事件进行分类。此外,我们通过高资源语言数据状况的标记数据,将信息从其他(低资源)语言(8.0.0 %)和/或新的(以前看不见的)危机局势中的信息分类,从新的(过去)类型(危机)类型信息进行分类。我们的研究从文献中整合了一个大型统一了含有重要数据分类。在高语言的危机中,在高语言危机中改进了重要数据分类中,在持续数据的排序中是一个重要的数据问题。在高语言危机的交叉工作中的交叉上, 。在持续时间任务(8.0.01核心上,这是一种重要数据分类中,在持续时间任务。在持续时间(8.0)。