数据质量措施和有效的评价尺度 (Data Quality Measures and Efficient Evaluation Algorithms for Large-Scale High-Dimensional Data)

Machine learning has been proven to be effective in various application areas, such as object and speech recognition on mobile systems. Since a critical key to machine learning success is the availability of large training data, many datasets are being disclosed and published online. From a data consumer or manager point of view, measuring data quality is an important first step in the learning process. We need to determine which datasets to use, update, and maintain. However, not many practical ways to measure data quality are available today, especially when it comes to large-scale high-dimensional data, such as images and videos. This paper proposes two data quality measures that can compute class separability and in-class variability, the two important aspects of data quality, for a given dataset. Classical data quality measures tend to focus only on class separability; however, we suggest that in-class variability is another important data quality factor. We provide efficient algorithms to compute our quality measures based on random projections and bootstrapping with statistical benefits on large-scale high-dimensional data. In experiments, we show that our measures are compatible with classical measures on small-scale data and can be computed much more efficiently on large-scale high-dimensional datasets.

翻译：事实证明,在移动系统对象和语音识别等各种应用领域,机器学习证明是有效的。由于机器学习成功的关键是能否获得大型培训数据,许多数据集正在被披露并在线发布。从数据消费者或管理者的观点来看,衡量数据质量是学习过程中的重要第一步。我们需要确定哪些数据集可以使用、更新和维护。然而,目前没有多少衡量数据质量的实用方法,特别是在大规模高维数据,例如图像和视频方面。本文提出了两种数据质量措施,可以计算类分离性和类中的变异性,而数据质量的两个重要方面是给定数据集的。典型数据质量措施往往只侧重于类分离性;然而,我们建议,类中的变异性是另一个重要的数据质量因素。我们提供了有效的算法,以随机预测和串联为基础,用大规模高维数据的统计效益来计算我们的质量措施。在实验中,我们显示我们的措施与关于小规模数据的典型措施是相容的,可以更高效地计算高维数据的。

相关内容

Machine Learning

关注 2245

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

【斯坦福大学博士论文】大规模和高维统计学习方法和算法，147页pdf， Large-scale and high-dimensional statistical learning methods and algorithms

专知会员服务

26+阅读 · 2020年6月13日

【O’Reilly讲座】基于深度学习的异常检测方法用于检测大型数据集的质量：Anomaly detection using deep learning to measure the quality of large datasets

专知会员服务

31+阅读 · 2020年1月11日

【大规模数据系统，552页ppt】Large-scale Data Systems

专知会员服务

61+阅读 · 2019年12月21日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日