Automated machine learning (AutoML) frameworks have become important tools in the data scientists' arsenal, as they dramatically reduce the manual work devoted to the construction of ML pipelines. Such frameworks intelligently search among millions of possible ML pipelines - typically containing feature engineering, model selection and hyper parameters tuning steps - and finally output an optimal pipeline in terms of predictive accuracy. However, when the dataset is large, each individual configuration takes longer to execute, therefore the overall AutoML running times become increasingly high. To this end, we present SubStrat, an AutoML optimization strategy that tackles the data size, rather than configuration space. It wraps existing AutoML tools, and instead of executing them directly on the entire dataset, SubStrat uses a genetic-based algorithm to find a small yet representative data subset which preserves a particular characteristic of the full data. It then employs the AutoML tool on the small subset, and finally, it refines the resulted pipeline by executing a restricted, much shorter, AutoML process on the large dataset. Our experimental results, performed on two popular AutoML frameworks, Auto-Sklearn and TPOT, show that SubStrat reduces their running times by 79% (on average), with less than 2% average loss in the accuracy of the resulted ML pipeline.
翻译:自动机器学习框架已成为数据科学家库中的重要工具,因为这些框架大大减少了专用于建造ML管道的手工工作,因此成为了数据科学家库中的重要工具。这种框架明智地搜索了成百万个可能的ML管道,通常含有特征工程、模型选择和超高参数调制步骤,最后在预测准确性方面产出了一个最佳管道。然而,当数据集庞大时,每个单个配置需要更长的时间才能执行,因此,整个Automal的运行时间越来越高。为此,我们提出了一个AutStrat,一个AutoML优化战略,处理数据大小,而不是配置空间。它把现有的AutoML工具包起来,而不是直接放在整个数据集上执行,SubStrat使用基于基因的算法寻找一个小型但具有代表性的数据子集,从而保留了全部数据的特定特征。当数据集庞大时,每个单个配置需要花费更长的时间来执行AutoML的运行时间,最后,通过在大型数据集中执行一个限制的、短得多的AutoML流程来改进最终的管道。我们的实验结果,在两个流行的Auto-Sklearn和Piltal平均的精确率中,显示其运行率为299%。