The emergence of the novel coronavirus (COVID-19) has generated a need to quickly and accurately assemble up-to-date information related to its spread. While it is possible to use deaths to provide a reliable information feed, the latency of data derived from deaths is significant. Confirmed cases derived from positive test results potentially provide a lower latency data feed. However, the sampling of those tested varies with time and the reason for testing is often not recorded. Hospital admissions typically occur around 1-2 weeks after infection and can be considered out of date in relation to the time of initial infection. The extent to which these issues are problematic is likely to vary over time and between countries. We use a machine learning algorithm for natural language processing, trained in multiple languages, to identify symptomatic individuals derived from social media and, in particular Twitter, in real-time. We then use an extended SEIRD epidemiological model to fuse combinations of low-latency feeds, including the symptomatic counts from Twitter, with death data to estimate parameters of the model and nowcast the number of people in each compartment. The model is implemented in the probabilistic programming language Stan and uses a bespoke numerical integrator. We present results showing that using specific low-latency data feeds along with death data provides more consistent and accurate forecasts of COVID-19 related deaths than using death data alone.
翻译:新的冠状病毒(COVID-19)的出现导致需要迅速和准确地收集与其传播有关的最新信息。虽然有可能使用死亡来提供可靠的信息反馈,但死亡数据的长期性是显著的。从正试验结果中确证的病例可能提供较低的潜伏数据反馈。然而,测试时间和原因不同,测试原因往往没有记录。医院的入院率通常发生在感染后1-2周左右,可以考虑与最初感染时间有关的最新信息。这些问题的问题可能随着时间和国与国之间的变化而变化。我们使用机器学习算法进行自然语言处理,以多种语言培训,确定来自社会媒体的症状人,特别是实时的Twitter。我们随后使用扩大的SEIIRD流行病学模型,将低延时饲料(包括Twitter的症状计数)结合在一起,用死亡数据来估计模型的参数,然后根据每个车厢中的人的数量进行计算。这种模型仅用于精确性规划 Stan-19语言的自然语言处理,我们用机器学习算法来识别从社会媒体,特别是Twitter获得死亡数据。我们用更精确的预测数据,用更精确的数据来提供精确的数据。