As big data becomes ubiquitous across domains, and more and more stakeholders aspire to make the most of their data, demand for machine learning tools has spurred researchers to explore the possibilities of automated machine learning (AutoML). AutoML tools aim to make machine learning accessible for non-machine learning experts (domain experts), to improve the efficiency of machine learning, and to accelerate machine learning research. But although automation and efficiency are among AutoML's main selling points, the process still requires human involvement at a number of vital steps, including understanding the attributes of domain-specific data, defining prediction problems, creating a suitable training data set, and selecting a promising machine learning technique. These steps often require a prolonged back-and-forth that makes this process inefficient for domain experts and data scientists alike, and keeps so-called AutoML systems from being truly automatic. In this review article, we introduce a new classification system for AutoML systems, using a seven-tiered schematic to distinguish these systems based on their level of autonomy. We begin by describing what an end-to-end machine learning pipeline actually looks like, and which subtasks of the machine learning pipeline have been automated so far. We highlight those subtasks which are still done manually - generally by a data scientist - and explain how this limits domain experts' access to machine learning. Next, we introduce our novel level-based taxonomy for AutoML systems and define each level according to the scope of automation support provided. Finally, we lay out a roadmap for the future, pinpointing the research required to further automate the end-to-end machine learning pipeline and discussing important challenges that stand in the way of this ambitious goal.
翻译:随着大数据在各个领域变得无处不在,越来越多的利益攸关方希望充分利用其数据,对机器学习工具的需求促使研究人员探索自动机器学习(Automal)的可能性。自动学习工具的目的是让非机器学习专家(域专家)有机会利用机器学习,提高机器学习的效率,加速机器学习研究。但是,虽然自动化和效率是AutML的主要销售点之一,但这一进程仍然需要人参与若干重要步骤,包括了解具体领域数据的特点,确定预测问题,创建合适的培训数据集,并选择有希望的机器学习技术。这些步骤往往需要长时间的前后间隔,使域专家和数据科学家都无法利用这一过程,使所谓的自动学习系统不再真正自动化。在本评论文章中,我们为自动MLML系统引入一个新的分类系统分类系统,使用七层结构来根据这些系统的自主性水平来区分这些系统。我们首先描述一个最终到最终学习管道的挑战是什么样子,这些工具的子塔级往往让域专家和数据专家都能够从一个最终的轨道上学习。