Epidemiologist, Scientists, Statisticians, Historians, Data engineers and Data scientists are working on finding descriptive models and theories to explain COVID-19 expansion phenomena or on building analytics predictive models for learning the apex of COVID-19 confimed cases, recovered cases, and deaths evolution curves. In CRISP-DM life cycle, 75% of time is consumed only by data preparation phase causing lot of pressions and stress on scientists and data scientists building machine learning models. This paper aims to help reducing data preparation efforts by presenting detailed schemas design and data preparation technical scripts for formatting and storing Johns Hopkins University COVID-19 daily data in HBase NoSQL data store, and enabling HiveQL COVID-19 data querying in a relational Hive SQL-like style.
翻译:流行病学家、科学家、统计学家、历史学家、数据工程师和数据科学家正在努力寻找描述模型和理论,以解释COVID-19扩张现象,或建立分析预测模型,以学习COVID-19同化病例、已发现病例和死亡演变曲线的顶点。在CRISP-DM生命周期中,75%的时间仅通过数据准备阶段来消耗,给科学家和数据科学家制造机器学习模型造成大量压力和压力。本文旨在通过在HBase NoSQL数据存储库中为约翰·霍普金斯大学COVID-19格式化和储存格式化和储存提供详细的系统设计和数据编制技术脚本,使HiveQL COVID-19数据查询与SQL类似风格的HiveQL数据,从而有助于减少数据编制工作。