数据质量工具包:自动评估数据质量和机器学习数据集的补救 (Data Quality Toolkit: Automatic assessment of data quality and remediation for machine learning datasets)

Nitin Gupta,Hima Patel,Shazia Afzal,Naveen Panwar,Ruhi Sharma Mittal,Shanmukha Guttula,Abhinav Jain,Lokesh Nagalapatti,Sameep Mehta,Sandeep Hans,Pranay Lohia,Aniya Aggarwal,Diptikalyan Saha

The quality of training data has a huge impact on the efficiency, accuracy and complexity of machine learning tasks. Various tools and techniques are available that assess data quality with respect to general cleaning and profiling checks. However these techniques are not applicable to detect data issues in the context of machine learning tasks, like noisy labels, existence of overlapping classes etc. We attempt to re-look at the data quality issues in the context of building a machine learning pipeline and build a tool that can detect, explain and remediate issues in the data, and systematically and automatically capture all the changes applied to the data. We introduce the Data Quality Toolkit for machine learning as a library of some key quality metrics and relevant remediation techniques to analyze and enhance the readiness of structured training datasets for machine learning projects. The toolkit can reduce the turn-around times of data preparation pipelines and streamline the data quality assessment process. Our toolkit is publicly available via IBM API Hub [1] platform, any developer can assess the data quality using the IBM's Data Quality for AI apis [2]. Detailed tutorials are also available on IBM Learning Path [3].

翻译：培训数据的质量对机器学习任务的效率、准确性和复杂性产生了巨大影响。现有各种工具和技术可以评估一般清洁和特征分析检查的数据质量。但这些技术并不适用于在机器学习任务中发现数据问题,如贴贴贴噪音标签、存在重叠的班级等。我们试图在建立机器学习管道的背景下重新审视数据质量问题,并建立一个工具,能够检测、解释和补救数据中的问题,并系统、自动地捕捉对数据应用的所有变化。我们引入了机器学习数据质量工具包,作为一些关键质量指标和相关补救技术的图书馆,用以分析和提高机器学习项目结构化培训数据集的准备状态。工具包可以缩短数据编制管道的周转时间,简化数据质量评估程序。我们的工具包可以通过IBM API 中心[1]平台公开提供,任何开发者都可以使用IBM的数据质量来评估数据质量。[2]。还在IBM学习路径上提供详细的辅导[3]。

相关内容

Machine Learning

关注 2241

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

【机器学习傻瓜式入门，443页pdf】Machine Learning For Dummies, 2nd Edition

专知会员服务

71+阅读 · 2021年1月26日

【伯克利】机器学习蛋白质工程，Machine learning for protein engineering，83页ppt

专知会员服务

36+阅读 · 2020年5月9日

【斯坦福】机器学习优化简明导论， Introduction to Optimization for Machine Learning

专知会员服务

93+阅读 · 2020年5月6日