Recent work has made significant progress in helping users to automate single data preparation steps, such as string-transformations and table-manipulation operators (e.g., Join, GroupBy, Pivot, etc.). We in this work propose to automate multiple such steps end-to-end, by synthesizing complex data pipelines with both string transformations and table-manipulation operators. We propose a novel "by-target" paradigm that allows users to easily specify the desired pipeline, which is a significant departure from the traditional by-example paradigm. Using by-target, users would provide input tables (e.g., csv or json files), and point us to a "target table" (e.g., an existing database table or BI dashboard) to demonstrate how the output from the desired pipeline would schematically "look like". While the problem is seemingly underspecified, our unique insight is that implicit table constraints such as FDs and keys can be exploited to significantly constrain the space to make the problem tractable. We develop an Auto-Pipeline system that learns to synthesize pipelines using reinforcement learning and search. Experiments on large numbers of real pipelines crawled from GitHub suggest that Auto-Pipeline can successfully synthesize 60-70% of these complex pipelines (up to 10 steps) in 10-20 seconds on average.
翻译:最近的工作在帮助用户实现单项数据编制步骤自动化方面取得了显著进展,例如字符串转换和表控操作员(例如,JING、GroupBy、Pivot等)。我们在此工作中建议通过将复杂的数据管道与字符串转换和表控操作员合并,将多个此类步骤的端端到端自动化。我们提出了一个新的“逐目标”模式,使用户能够轻松地指定所需的管道,这与传统的旁观模式大为背离。用户将使用目标提供输入表(例如,Csv或json文件),并指示我们“目标表格”(例如,现有数据库表格或BI仪表),以显示如何用字符串转换和表控管操作器的“示意性”。虽然问题似乎未得到充分描述,但我们独特的洞察到,可以利用FD和键等隐含的表格限制来大大限制空间,使问题可定位。我们开发了一个自动到20的输入表系统(例如,csv或json文件),将我们指向一个“目标表格”(例如,现有的数据库表格表格表格表格表格表格表格表或BIIL),以便通过学习10号的大型平均速度将GIAULULLLA学习,从而成功的10的10号进行模拟。我们可以成功。我们在10号上学习。