Developing machine learning models can be seen as a process similar to the one established for traditional software development. A key difference between the two lies in the strong dependency between the quality of a machine learning model and the quality of the data used to train or perform evaluations. In this work, we demonstrate how different aspects of data quality propagate through various stages of machine learning development. By performing a joint analysis of the impact of well-known data quality dimensions and the downstream machine learning process, we show that different components of a typical MLOps pipeline can be efficiently designed, providing both a technical and theoretical perspective.
翻译:开发机器学习模式可被视为一个类似于传统软件开发所建立的进程,两者之间的一个关键区别在于机器学习模式的质量与用于培训或进行评价的数据的质量之间高度依赖。在这项工作中,我们展示了数据质量的不同方面如何通过机器学习发展的各个阶段传播。通过对众所周知的数据质量层面和下游机器学习过程的影响进行联合分析,我们表明,典型的MLops管道的不同组成部分可以有效地设计,提供技术和理论的视角。