Data workers usually seek to understand the semantics of data wrangling scripts in various scenarios, such as code debugging, reusing, and maintaining. However, the understanding is challenging for novice data workers due to the variety of programming languages, functions, and parameters. Based on the observation that differences between input and output tables highly relate to the type of data transformation, we outline a design space including 103 characteristics to describe table differences. Then, we develop COMANTICS, a three-step pipeline that automatically detects the semantics of data transformation scripts. The first step focuses on the detection of table differences for each line of wrangling code. Second, we incorporate a characteristic-based component and a Siamese convolutional neural network-based component for the detection of transformation types. Third, we derive the parameters of each data transformation by employing a "slot filling" strategy. We design experiments to evaluate the performance of COMANTICS. Further, we assess its flexibility using three example applications in different domains.
翻译:数据工作者通常寻求理解在各种假设情况下,如代码调试、重复使用和维护等数据拼接脚本的语义。然而,由于编程语言、功能和参数的多样性,对新数据工作者的理解具有挑战性。基于投入和产出表格之间的差异与数据转换类型高度相关这一观察,我们勾勒出一个设计空间,包括103个特征来描述表格差异。然后,我们开发了COMANTICS,一个三步管道,自动检测数据转换脚本的语义。第一步的重点是发现每行串连代码的表格差异。第二,我们为检测转换类型,采用了一个基于特性的组件和一个基于暹粒神经网络的组件。第三,我们通过使用“绘图填充”战略来得出每项数据转换的参数。我们设计了评估COMANTICS绩效的实验。此外,我们用不同领域的三个实例应用来评估其灵活性。