Objectives: Federal open data initiatives that promote increased sharing of federally collected data are important for transparency, data quality, trust, and relationships with the public and state, tribal, local, and territorial (STLT) partners. These initiatives advance understanding of health conditions and diseases by providing data to more researchers, scientists, and policymakers for analysis, collaboration, and valuable use outside CDC responders. This is particularly true for emerging conditions such as COVID-19 where we have much to learn and have evolving data needs. Since the beginning of the outbreak, CDC has collected person-level, de-identified data from jurisdictions and currently has over 8 million records, increasing each day. This paper describes how CDC designed and produces two de-identified public datasets from these collected data. Materials and Methods: Data elements were included based on the usefulness, public request, and privacy implications; specific field values were suppressed to reduce risk of reidentification and exposure of confidential information. Datasets were created and verified for privacy and confidentiality using data management platform analytic tools as well as R scripts. Results: Unrestricted data are available to the public through Data.CDC.gov and restricted data, with additional fields, are available with a data use agreement through a private repository on GitHub.com. Practice Implications: Enriched understanding of the available public data, the methods used to create these data, and the algorithms used to protect privacy of de-identified individuals allow for improved data use. Automating data generation procedures allows greater and more timely sharing of data.
翻译:联邦开放数据倡议,促进更多分享联邦收集的数据,对于透明度、数据质量、信任以及与公众和州、部落、地方和领土(STLT)伙伴的关系十分重要,通过向更多的研究人员、科学家和决策者提供数据,用于分析、协作和在疾病控制中心回应者之外的宝贵使用,增进了对健康状况和疾病的了解;对于COVID-19等新出现的条件来说尤其如此,因为在COVID-19等新出现的条件中,我们有很多需要学习和不断演变的数据需求。自爆发以来,疾病控制中心从各管辖区收集了个人层面的、已查明的数据,目前已有800多万记录,逐日增加。本文描述了疾病控制中心如何从这些所收集的数据中设计和制作两个已查明的公开数据集。材料和方法:数据要素是根据有用性、公共请求和隐私影响而列入的;具体实地价值受到抑制,以减少重新识别和接触机密信息的风险。数据集的创建和核查,利用数据管理平台分析性分析工具以及脚本。结果:通过数据的保密性协议,向公众提供非限制性数据,通过数据、使用更精确的数据储存领域,并使用更多的数据。