Orchestrating a high-quality data preparation program is essential for successful machine learning (ML), but it is known to be time and effort consuming. Despite the impressive capabilities of large language models like ChatGPT in generating programs by interacting with users through natural language prompts, there are still limitations. Specifically, a user must provide specific prompts to iteratively guide ChatGPT in improving data preparation programs, which requires a certain level of expertise in programming, the dataset used and the ML task. Moreover, once a program has been generated, it is non-trivial to revisit a previous version or make changes to the program without starting the process over again. In this paper, we present ChatPipe, a novel system designed to facilitate seamless interaction between users and ChatGPT. ChatPipe provides users with effective recommendation on next data preparation operations, and guides ChatGPT to generate program for the operations. Also, ChatPipe enables users to easily roll back to previous versions of the program, which facilitates more efficient experimentation and testing. We have developed a web application for ChatPipe and prepared several real-world ML tasks from Kaggle. These tasks can showcase the capabilities of ChatPipe and enable VLDB attendees to easily experiment with our novel features to rapidly orchestrate a high-quality data preparation program.
翻译:在成功进行机器学习(ML)之前,协调高质量的数据准备程序至关重要,但其已知需要耗费大量时间和精力。尽管大型语言模型如ChatGPT在与用户通过自然语言提示进行交互以生成程序方面具有惊人的能力,但仍存在局限性。具体而言,用户必须提供具体提示以逐步指导ChatGPT改进数据准备程序,这种需求需要对编程、使用的数据集以及ML任务有一定的专业知识。此外,一旦程序生成,重新访问以前的版本或者对程序进行更改是非常困难的,需要重新开始整个过程。在本文中,我们提出ChatPipe,一个旨在促进用户和ChatGPT之间无缝交互的创新系统。ChatPipe为用户提供了有效的下一个数据准备操作推荐,并指导ChatGPT生成操作的程序。此外,ChatPipe使用户轻松地回滚到程序的以前版本,从而促进更有效的实验和测试。我们已经为ChatPipe开发了一个Web应用程序,并准备了几个Kaggle的真实ML任务。这些任务可以展示ChatPipe的能力,并使VLDB参与者轻松地尝试我们的新功能,以快速协调高质量的数据准备程序。