Data is expanding at an unimaginable rate, and with this development comes the responsibility of the quality of data. Data Quality refers to the relevance of the information present and helps in various operations like decision making and planning in a particular organization. Mostly data quality is measured on an ad-hoc basis, and hence none of the developed concepts provide any practical application. The current empirical study was undertaken to formulate a concrete automated data quality platform to assess the quality of incoming dataset and generate a quality label, score and comprehensive report. We utilize various datasets from healthdata.gov, opendata.nhs and Demographics and Health Surveys (DHS) Program to observe the variations in the quality score and formulate a label using Principal Component Analysis(PCA). The results of the current empirical study revealed a metric that encompasses nine quality ingredients, namely provenance, dataset characteristics, uniformity, metadata coupling, percentage of missing cells and duplicate rows, skewness of data, the ratio of inconsistencies of categorical columns, and correlation between these attributes. The study also provides an illustrative case study and validation of the metric following Mutation Testing approaches. This research study provides an automated platform which takes an incoming dataset and metadata to provide the DQ score, report and label. The results of this study would be useful to data scientists as the value of this quality label would instill confidence before deploying the data for his/her respective practical application.
翻译:数据质量是指现有信息的适切性,并且有助于特定组织的决策和规划等各种业务; 数据质量是指现有信息的相关性,有助于特定组织的决策和规划等各种业务; 大部分数据质量是在临时基础上测量的,因此,发达概念中没有任何一种能够提供任何实际应用; 进行目前的实证研究的目的是建立一个具体的自动化数据质量自动化平台,以评估收到的数据集的质量,并产生高质量的标签、评分和综合报告; 我们利用来自健康数据.gov、公开数据.nhs和人口与健康调查(DHS)方案的各种数据集,观察质量评分的差异,并利用主要组成部分分析(PCA)制作标签; 目前的实证研究的结果揭示了包含九个质量要素的衡量标准,即来源、数据集特点、统一性、元数据组合、缺失的细胞和重复行的百分比、数据的扭曲性、直线柱的不一致性比率以及这些属性之间的相互关系; 研究还提供说明性案例研究,并验证在进行抽查方法之后的计量方法。 本次实证研究的结果研究结果显示,在进行数据评级之前,将采用一个自动平台,用于进行数据评级。