In this paper, we present our vision of differentiable ML pipelines called DiffML to automate the construction of ML pipelines in an end-to-end fashion. The idea is that DiffML allows to jointly train not just the ML model itself but also the entire pipeline including data preprocessing steps, e.g., data cleaning, feature selection, etc. Our core idea is to formulate all pipeline steps in a differentiable way such that the entire pipeline can be trained using backpropagation. However, this is a non-trivial problem and opens up many new research questions. To show the feasibility of this direction, we demonstrate initial ideas and a general principle of how typical preprocessing steps such as data cleaning, feature selection and dataset selection can be formulated as differentiable programs and jointly learned with the ML model. Moreover, we discuss a research roadmap and core challenges that have to be systematically tackled to enable fully differentiable ML pipelines.
翻译:在本文中,我们展示了我们关于不同的、称为DiffML的、称为DiffML的管道的设想,目的是以端到端的方式使ML管道的建造自动化。想法是,DiffML不仅允许联合培训ML模型本身,而且允许联合培训整个管道,包括数据处理前步骤,例如数据清理、特征选择等。我们的核心想法是以不同的方式制定所有管道步骤,以便整个管道能够利用反向分析来培训。然而,这是一个非三重问题,并开启了许多新的研究问题。为了展示这一方向的可行性,我们展示了初步的想法和一般原则,说明如何将数据清理、特征选择和数据集选择等典型的预处理步骤作为不同的方案,并与ML模型共同学习。此外,我们讨论了研究路线图和核心挑战,必须系统地加以解决,以便能够完全不同的 ML管道。