Automatic machine learning, or AutoML, holds the promise of truly democratizing the use of machine learning (ML), by substantially automating the work of data scientists. However, the huge combinatorial search space of candidate pipelines means that current AutoML techniques, generate sub-optimal pipelines, or none at all, especially on large, complex datasets. In this work we propose an AutoML technique SapientML, that can learn from a corpus of existing datasets and their human-written pipelines, and efficiently generate a high-quality pipeline for a predictive task on a new dataset. To combat the search space explosion of AutoML, SapientML employs a novel divide-and-conquer strategy realized as a three-stage program synthesis approach, that reasons on successively smaller search spaces. The first stage uses a machine-learned model to predict a set of plausible ML components to constitute a pipeline. In the second stage, this is then refined into a small pool of viable concrete pipelines using syntactic constraints derived from the corpus and the machine-learned model. Dynamically evaluating these few pipelines, in the third stage, provides the best solution. We instantiate SapientML as part of a fully automated tool-chain that creates a cleaned, labeled learning corpus by mining Kaggle, learns from it, and uses the learned models to then synthesize pipelines for new predictive tasks. We have created a training corpus of 1094 pipelines spanning 170 datasets, and evaluated SapientML on a set of 41 benchmark datasets, including 10 new, large, real-world datasets from Kaggle, and against 3 state-of-the-art AutoML tools and 2 baselines. Our evaluation shows that SapientML produces the best or comparable accuracy on 27 of the benchmarks while the second best tool fails to even produce a pipeline on 9 of the instances.
翻译:自动机器学习( AutoML ) 有望通过使数据科学家的工作实现高度自动化,使机器学习(ML)的使用真正民主化。 然而,对候选管道进行大量组合式搜索空间,这意味着当前AutoML技术,产生亚最佳管道,或根本没有。 特别是在大型复杂的数据集中。在此工作中,我们提议了AutoML技术SapientMLL, 可以从现有数据集及其人造管道中学习,并高效地为新数据集的预测任务创造高质量的管道。为了打击AutoML的搜索空间爆炸,SapientML采用新的管道空间搜索空间搜索空间爆炸,SapientML采用新的分流空间搜索空间搜索空间搜索空间搜索空间搜索空间搜索空间搜索空间空间空间空间搜索空间空间空间空间空间空间空间空间空间空间空间空间搜索空间空间空间空间空间空间空间空间空间空间空间空间空间空间空间空间,作为三阶段程序合成方法实现的新的鸿沟和孔径断战略,这是连续较小搜索空间空间空间空间空间中出现的原因。 第一阶段使用机器学习模型模型模型模型模型模型模型来预测一组可靠的ML工具。 第二阶段,然后将改进到一个新的混凝基准基准基准基准,从新基模型,从新基数据库到机械模型到机器模型,在10级数据库中,我们学习了10级数据库中生成的模型, 将产生一个最先进的模型,从新的数据库数据模型, 和机械化数据。