The quality of training data has a huge impact on the efficiency, accuracy and complexity of machine learning tasks. Various tools and techniques are available that assess data quality with respect to general cleaning and profiling checks. However these techniques are not applicable to detect data issues in the context of machine learning tasks, like noisy labels, existence of overlapping classes etc. We attempt to re-look at the data quality issues in the context of building a machine learning pipeline and build a tool that can detect, explain and remediate issues in the data, and systematically and automatically capture all the changes applied to the data. We introduce the Data Quality Toolkit for machine learning as a library of some key quality metrics and relevant remediation techniques to analyze and enhance the readiness of structured training datasets for machine learning projects. The toolkit can reduce the turn-around times of data preparation pipelines and streamline the data quality assessment process. Our toolkit is publicly available via IBM API Hub [1] platform, any developer can assess the data quality using the IBM's Data Quality for AI apis [2]. Detailed tutorials are also available on IBM Learning Path [3].
翻译:培训数据的质量对机器学习任务的效率、准确性和复杂性产生了巨大影响。现有各种工具和技术可以评估一般清洁和特征分析检查的数据质量。但这些技术并不适用于在机器学习任务中发现数据问题,如贴贴贴噪音标签、存在重叠的班级等。我们试图在建立机器学习管道的背景下重新审视数据质量问题,并建立一个工具,能够检测、解释和补救数据中的问题,并系统、自动地捕捉对数据应用的所有变化。我们引入了机器学习数据质量工具包,作为一些关键质量指标和相关补救技术的图书馆,用以分析和提高机器学习项目结构化培训数据集的准备状态。工具包可以缩短数据编制管道的周转时间,简化数据质量评估程序。我们的工具包可以通过IBM API 中心[1]平台公开提供,任何开发者都可以使用IBM的数据质量来评估数据质量。[2]。还在IBM学习路径上提供详细的辅导[3]。