Over the past few months, the outbreak of Coronavirus disease (COVID-19) has been expanding over the world. A reliable and accurate dataset of the cases is vital for scientists to conduct related research and for policy-makers to make better decisions. We collect the United States COVID-19 daily reported data from four open sources: the New York Times, the COVID-19 Data Repository by Johns Hopkins University, the COVID Tracking Project at the Atlantic, and the USAFacts, then compare the similarities and differences among them. To obtain reliable data for further analysis, we first examine the cyclical pattern and the following anomalies, which frequently occur in the reported cases: (1) the order dependencies violation, (2) the point or period anomalies, and (3) the issue of reporting delay. To address these detected issues, we propose the corresponding repairing methods and procedures if corrections are necessary. In addition, we integrate the COVID-19 reported cases with the county-level auxiliary information of the local features from official sources, such as health infrastructure, demographic, socioeconomic, and environmental information, which are also essential for understanding the spread of the virus.
翻译:在过去几个月里,科罗纳病毒(科罗纳病毒19)的爆发在全世界不断扩大,对科学家进行相关研究和决策者作出更好的决定来说,可靠和准确的病例数据集至关重要。我们从四个公开来源收集美国COVID-19日报数据:纽约时报、约翰霍普金斯大学COVID-19数据储存库、大西洋COVID跟踪项目和美国AFacts,然后比较它们之间的异同。为了获得可靠的数据,我们首先审查周期性模式和下列异常情况,在报告的案件中经常发生:(1) 违反秩序,(2) 点或时期异常,(3) 报告延误问题。为了解决这些问题,我们建议必要时采取相应的修复方法和程序。此外,我们把COVID-19报告的案件与官方来源(如卫生基础设施、人口、社会经济和环境信息)提供的当地特征的县级辅助信息结合起来,这对了解病毒的传播也至关重要。