Timely and effective response to humanitarian crises requires quick and accurate analysis of large amounts of text data - a process that can highly benefit from expert-assisted NLP systems trained on validated and annotated data in the humanitarian response domain. To enable creation of such NLP systems, we introduce and release HumSet, a novel and rich multilingual dataset of humanitarian response documents annotated by experts in the humanitarian response community. The dataset provides documents in three languages (English, French, Spanish) and covers a variety of humanitarian crises from 2018 to 2021 across the globe. For each document, HUMSET provides selected snippets (entries) as well as assigned classes to each entry annotated using common humanitarian information analysis frameworks. HUMSET also provides novel and challenging entry extraction and multi-label entry classification tasks. In this paper, we take a first step towards approaching these tasks and conduct a set of experiments on Pre-trained Language Models (PLM) to establish strong baselines for future research in this domain. The dataset is available at https://blog.thedeep.io/humset/.
翻译:及时、有效地应对人道主义危机需要快速、准确地分析大量文本数据 -- -- 这一过程可以极大地受益于在人道主义应急领域经过验证和附加说明数据培训的专家协助的NLP系统。为了能够创建这种NLP系统,我们推出并发布HumSet,这是人道主义应急界专家附加说明的人道主义应急文件的新颖和丰富的多语种数据集。该数据集以三种语言(英文、法文、西班牙文)提供文件,覆盖2018年至2021年全球各种人道主义危机。HIMSET为每份文件提供选用的狙击手(实体),并利用共同的人道主义信息分析框架为每份条目提供附加说明的班。HumSET还提供新颖和具有挑战性的条目提取和多标签条目分类任务。在本文件中,我们迈出第一步,着手完成这些任务,并进行一套关于预先培训的语言模型的实验,以便为该领域的未来研究建立强有力的基准。该数据集可在https://blog.thedep.io/humset/上查阅。