Machine learning in practice often involves complex pipelines for data cleansing, feature engineering, preprocessing, and prediction. These pipelines are composed of operators, which have to be correctly connected and whose hyperparameters must be correctly configured. Unfortunately, it is quite common for certain combinations of datasets, operators, or hyperparameters to cause failures. Diagnosing and fixing those failures is tedious and error-prone and can seriously derail a data scientist's workflow. This paper describes an approach for automatically debugging an ML pipeline, explaining the failures, and producing a remediation. We implemented our approach, which builds on a combination of AutoML and SMT, in a tool called Maro. Maro works seamlessly with the familiar data science ecosystem including Python, Jupyter notebooks, scikit-learn, and AutoML tools such as Hyperopt. We empirically evaluate our tool and find that for most cases, a single remediation automatically fixes errors, produces no additional faults, and does not significantly impact optimal accuracy nor time to convergence.
翻译:在实践中,机器学习往往涉及数据清洗、地物工程、预处理和预测等复杂的管道,这些管道由操作者组成,它们必须正确连接,其超参数必须正确配置。不幸的是,对于数据集、操作者或超参数的某些组合来说,这很常见,会造成失败。诊断和纠正这些失败是乏味和容易出错的,可以严重干扰数据科学家的工作流程。本文描述了自动调试ML管道、解释失败和产生补救的方法。我们实施了我们的方法,它建立在自动ML和SMT(SMT)的组合上,使用名为Maro的工具。Maro与熟悉的数据科学生态系统,包括Python、Jupyter笔记本、Scikit-learn和UtalML工具,如Huperopt。我们用经验评估了我们的工具,发现在大多数情况下,单项补救自动纠正错误,不会产生额外的错误,也不会对最佳精确性和时间产生显著的影响。