Data exploration and quality analysis is an important yet tedious process in the AI pipeline. Current practices of data cleaning and data readiness assessment for machine learning tasks are mostly conducted in an arbitrary manner which limits their reuse and results in loss of productivity. We introduce the concept of a Data Readiness Report as an accompanying documentation to a dataset that allows data consumers to get detailed insights into the quality of input data. Data characteristics and challenges on various quality dimensions are identified and documented keeping in mind the principles of transparency and explainability. The Data Readiness Report also serves as a record of all data assessment operations including applied transformations. This provides a detailed lineage for the purpose of data governance and management. In effect, the report captures and documents the actions taken by various personas in a data readiness and assessment workflow. Overtime this becomes a repository of best practices and can potentially drive a recommendation system for building automated data readiness workflows on the lines of AutoML [8]. We anticipate that together with the Datasheets [9], Dataset Nutrition Label [11], FactSheets [1] and Model Cards [15], the Data Readiness Report completes the AI documentation pipeline.
翻译:在AI编审过程中,数据勘探和质量分析是一个重要但又繁琐的过程。目前为机器学习任务进行的数据清理和数据准备状态评估做法大多是任意的,限制了其再利用,导致生产力丧失。我们引入数据准备状态报告的概念,作为数据集的附带文件,使数据消费者能够详细了解输入数据的质量。在确定和记录质量各个方面的数据特点和挑战时,铭记透明度和解释性原则。数据准备状态报告还作为所有数据评估作业的记录,包括应用转换。这为数据治理和管理提供了详细的线条。实际上,报告收集并记录了不同人员在数据准备状态和评估工作流程中采取的行动。随着时间的推移,它成为最佳做法的存放处,并有可能推动一个建议系统,在Automal[8]线上建立自动的数据准备流程。我们预计,数据准备状态报告与数据表[9]、数据集营养标签[11]、AfactSeets[1]和模版卡[15]一起,完成AI文件编程。