Data science (DS) projects often follow a lifecycle that consists of laborious tasks for data scientists and domain experts (e.g., data exploration, model training, etc.). Only till recently, machine learning(ML) researchers have developed promising automation techniques to aid data workers in these tasks. This paper introduces AutoDS, an automated machine learning (AutoML) system that aims to leverage the latest ML automation techniques to support data science projects. Data workers only need to upload their dataset, then the system can automatically suggest ML configurations, preprocess data, select algorithm, and train the model. These suggestions are presented to the user via a web-based graphical user interface and a notebook-based programming user interface. We studied AutoDS with 30 professional data scientists, where one group used AutoDS, and the other did not, to complete a data science project. As expected, AutoDS improves productivity; Yet surprisingly, we find that the models produced by the AutoDS group have higher quality and less errors, but lower human confidence scores. We reflect on the findings by presenting design implications for incorporating automation techniques into human work in the data science lifecycle.
翻译:数据科学(DS)项目往往遵循由数据科学家和领域专家(例如数据探索、模型培训等)的艰苦任务组成的生命周期。直到最近,机器学习(ML)研究人员才开发出有希望的自动化技术,以帮助数据工作者完成这些任务。本文介绍AutoDS,这是一个自动机学习系统,目的是利用最新的ML自动化技术支持数据科学项目。数据工作者只需上传数据,然后系统就可以自动提出ML配置、预处理数据、选择算法和培训模型。这些建议通过基于网络的图形用户界面和笔记本编程用户界面向用户提出。我们用30名专业数据科学家(其中一组使用AutoDS,另一组没有这样做)研究AutoDS,以完成一个数据科学项目。正如预期的那样,AutoDS提高了生产率;令人惊讶的是,我们发现AutoDS小组制作的模型的质量较高,但错误较少,但人的信任度较低。我们通过在数据科学生命周期中提出将自动化技术纳入人类工作的设计影响来思考研究结果。