新冠肺炎社交媒体信息的新型数据集成框架 (A New Data Integration Framework for Covid-19 Social Media Information)

The Covid-19 pandemic presents a serious threat to people health, resulting in over 250 million confirmed cases and over 5 million deaths globally. To reduce the burden on national health care systems and to mitigate the effects of the outbreak, accurate modelling and forecasting methods for short- and long-term health demand are needed to inform government interventions aiming at curbing the pandemic. Current research on Covid-19 is typically based on a single source of information, specifically on structured historical pandemic data. Other studies are exclusively focused on unstructured online retrieved insights, such as data available from social media. However, the combined use of structured and unstructured information is still uncharted. This paper aims at filling this gap, by leveraging historical and social media information with a novel data integration methodology. The proposed approach is based on vine copulas, which allow us to exploit the dependencies between different sources of information. We apply the methodology to combine structured datasets retrieved from official sources and a big unstructured dataset of information collected from social media. The results show that the combined use of official and online generated information contributes to yield a more accurate assessment of the evolution of the Covid-19 pandemic, compared to the sole use of official data.

翻译：新冠肺炎疫情对人类健康造成了严重威胁，全球确诊病例超过 2.5 亿，死亡人数超过 500 万。为了减轻国家健康系统的负担并缓解大流行的影响，需要准确的短期和长期健康需求建模和预测方法，以指导旨在遏制大流行的政府干预措施。当前 covid-19 研究通常基于单一信息来源，特别是基于结构化历史疫情数据。其他研究则专门关注在线检索的非结构化见解，例如社交媒体可用的数据。然而，结构化和非结构化信息的结合仍是未知的领域。本文旨在填补这一空白，通过采用一种新的数据集成方法来利用历史和社交媒体信息。所提出的方法基于葡萄藤 copulas，使我们能够利用不同信息源之间的依赖关系。我们将该方法应用于将官方来源检索到的结构化数据集与从社交媒体收集到的大型非结构化信息数据集相结合。结果显示，与仅使用官方数据相比，官方数据和在线生成信息的组合使用有助于更准确地评估 covid-19 疫情的演变。