Machine learning has been proven to be effective in various application areas, such as object and speech recognition on mobile systems. Since a critical key to machine learning success is the availability of large training data, many datasets are being disclosed and published online. From a data consumer or manager point of view, measuring data quality is an important first step in the learning process. We need to determine which datasets to use, update, and maintain. However, not many practical ways to measure data quality are available today, especially when it comes to large-scale high-dimensional data, such as images and videos. This paper proposes two data quality measures that can compute class separability and in-class variability, the two important aspects of data quality, for a given dataset. Classical data quality measures tend to focus only on class separability; however, we suggest that in-class variability is another important data quality factor. We provide efficient algorithms to compute our quality measures based on random projections and bootstrapping with statistical benefits on large-scale high-dimensional data. In experiments, we show that our measures are compatible with classical measures on small-scale data and can be computed much more efficiently on large-scale high-dimensional datasets.
翻译:事实证明,在移动系统对象和语音识别等各种应用领域,机器学习证明是有效的。由于机器学习成功的关键是能否获得大型培训数据,许多数据集正在被披露并在线发布。从数据消费者或管理者的观点来看,衡量数据质量是学习过程中的重要第一步。我们需要确定哪些数据集可以使用、更新和维护。然而,目前没有多少衡量数据质量的实用方法,特别是在大规模高维数据,例如图像和视频方面。本文提出了两种数据质量措施,可以计算类分离性和类中的变异性,而数据质量的两个重要方面是给定数据集的。典型数据质量措施往往只侧重于类分离性;然而,我们建议,类中的变异性是另一个重要的数据质量因素。我们提供了有效的算法,以随机预测和串联为基础,用大规模高维数据的统计效益来计算我们的质量措施。在实验中,我们显示我们的措施与关于小规模数据的典型措施是相容的,可以更高效地计算高维数据的。