The COVID-19 pandemic has impacted lives and economies across the globe, leading to many deaths. While vaccination is an important intervention, its roll-out is slow and unequal across the globe. Therefore, extensive testing still remains one of the key methods to monitor and contain the virus. Testing on a large scale is expensive and arduous. Hence, we need alternate methods to estimate the number of cases. Online surveys have been shown to be an effective method for data collection amidst the pandemic. In this work, we develop machine learning models to estimate the prevalence of COVID-19 using self-reported symptoms. Our best model predicts the daily cases with a mean absolute error (MAE) of 226.30 (normalized MAE of 27.09%) per state, which demonstrates the possibility of predicting the actual number of confirmed cases by utilizing self-reported symptoms. The models are developed at two levels of data granularity - local models, which are trained at the state level, and a single global model which is trained on the combined data aggregated across all states. Our results indicate a lower error on the local models as opposed to the global model. In addition, we also show that the most important symptoms (features) vary considerably from state to state. This work demonstrates that the models developed on crowd-sourced data, curated via online platforms, can complement the existing epidemiological surveillance infrastructure in a cost-effective manner.
翻译:COVID-19大流行影响到全球各地的生命和经济,导致许多死亡。尽管疫苗接种是一项重要的干预措施,但其推广速度缓慢且不平等。 因此,广泛的测试仍然是监测和遏制病毒的关键方法之一。 大规模测试是昂贵和艰巨的。 因此,我们需要用其他方法来估计病例数量。 在线调查已证明是该流行病中数据收集的有效方法。 在这项工作中,我们开发了机器学习模型,用自我报告的症状来估计COVID-19的流行程度。 我们的最佳模型预测每天的病例是226.30(正常的MAE为27.09%)每个州的平均绝对错误(MAE ), 这表明利用自我报告症状来预测经证实的实际病例数量的可能性。 这些模型是在两个层次的数据颗粒度-地方模型(在州一级接受培训)和单一的全球模型(在各州综合数据汇总方面接受培训)。 我们的结果表明,与全球模型相比,地方模型的误差要低。 此外,我们还展示了利用自我报告症状来预测实际确诊病例数量的可能性。 这些模型可以显示,从最有效益的在线模型(从成本到现在的模型)显示,从现有的模型可以大大改变。