In this paper, we present a manually annotated corpus of 10,000 tweets containing public reports of five COVID-19 events, including positive and negative tests, deaths, denied access to testing, claimed cures and preventions. We designed slot-filling questions for each event type and annotated a total of 31 fine-grained slots, such as the location of events, recent travel, and close contacts. We show that our corpus can support fine-tuning BERT-based classifiers to automatically extract publicly reported events and help track the spread of a new disease. We also demonstrate that, by aggregating events extracted from millions of tweets, we achieve surprisingly high precision when answering complex queries, such as "Which organizations have employees that tested positive in Philadelphia?" We will release our corpus (with user-information removed), automatic extraction models, and the corresponding knowledge base to the research community.
翻译:在本文中,我们以人工方式提交了10 000份附加说明的推文集,其中载有关于五起COVID-19事件的公开报告,包括正面和负面的测试、死亡、被拒绝接受检测、声称的治疗和预防等。我们设计了每种事件类型的填充空档问题,并附加了总共31个细微的空档,如事件地点、最近的旅行和密切接触。我们展示了我们的应用程序可以支持对基于BERT的分类人员进行微调,以便自动提取公开报道的事件并帮助跟踪新疾病的蔓延。我们还表明,通过汇总从数百万次推文中提取的事件,我们在回答复杂的询问时取得了惊人的高度精确性,例如“哪些组织在费城有检测阳性的雇员?”我们将向研究界发布我们的档案(删除用户信息)、自动提取模型和相应的知识库。