The Covid-19 pandemic presents a serious threat to people's health, resulting in over 250 million confirmed cases and over 5 million deaths globally. In order to reduce the burden on national health care systems and to mitigate the effects of the outbreak, accurate modelling and forecasting methods for short- and long-term health demand are needed to inform government interventions aiming at curbing the pandemic. Current research on Covid-19 is typically based on a single source of information, specifically on structured historical pandemic data. Other studies are exclusively focused on unstructured online retrieved insights, such as data available from social media. However, the combined use of structured and unstructured information is still uncharted. This paper aims at filling this gap, by leveraging historical as well as social media information with a novel data integration methodology. The proposed approach is based on vine copulas, which allow us to improve predictions by exploiting the dependencies between different sources of information. We apply the methodology to combine structured datasets retrieved from official sources and to a big unstructured dataset of information collected from social media. The results show that the proposed approach, compared to traditional approaches, yields more accurate estimations and predictions of the evolution of the Covid-19 pandemic.
翻译:Covid-19大流行对人的健康构成严重威胁,造成全球超过2.5亿个确诊病例和500多万人死亡,为了减轻国家保健系统的负担和减轻疫情的影响,需要为短期和长期保健需求制定准确的建模和预测方法,以便为政府旨在遏制这一流行病的干预措施提供信息。目前对Covid-19大流行的研究通常基于单一的信息来源,特别是结构化的历史大流行数据。其他研究则完全侧重于非结构化的在线检索的洞察,如社会媒体提供的数据。然而,结构化和非结构化信息的合并使用仍然未见于图象。本文旨在填补这一空白,利用新的数据集成方法利用历史和社交媒体信息来弥补这一空白。拟议方法以葡萄干草为基础,使我们能够通过利用不同信息来源之间的依赖性来改进预测。我们采用的方法将从官方来源检索的结构化数据集和从社会媒体收集的信息的大规模非结构化数据集结合起来。结果显示,与传统方法相比,拟议的C-19大流行性估算和演变预测更准确。