Data preprocessing is a crucial stage in the data analysis pipeline, with both technical and social aspects to consider. Yet, the attention it receives is often lacking in research practice and dissemination. We present the Smallset Timeline, a visualisation to help reflect on and communicate data preprocessing decisions. A "Smallset" is a small selection of rows from the original dataset containing instances of dataset alterations. The Timeline is comprised of Smallset snapshots representing different points in the preprocessing stage and captions to describe the alterations visualised at each point. Edits, additions, and deletions to the dataset are highlighted with colour. We develop the R software package, smallsets, that can create Smallset Timelines from R and Python data preprocessing scripts. Constructing the figure asks practitioners to reflect on and revise decisions as necessary, while sharing it aims to make the process accessible to a diverse range of audiences. We present two case studies to illustrate use of the Smallset Timeline for visualising preprocessing decisions. Case studies include software defect data and income survey benchmark data, in which preprocessing affects levels of data loss and group fairness in prediction tasks, respectively. We envision Smallset Timelines as a go-to data provenance tool, enabling better documentation and communication of preprocessing tasks at large.
翻译:数据预处理是数据分析管道中的一个关键阶段,需要考虑技术和社会方面。然而,它往往缺乏研究实践和传播方面的关注。我们介绍Smallseet Timeline,这是帮助思考和交流数据处理预处理决定的一种视觉化方法。“Smallset”是原始数据集中包含数据集变更实例的少量行选。“Smallset”由代表预处理阶段不同点的小型快照和描述每个点可视化的修改的字幕组成。编辑、添加和删除数据集时用彩色标出。我们开发了R软件包、小数据集,从R和Python数据处理预处理脚本中创建了Smreseet Timeline,可以创建小型时间线,有助于思考和在必要时修改决定。“Smallset”是由代表预处理阶段不同点的不同点和描述预处理决定的字幕组成。我们提出了两个案例研究,以说明如何使用Smallseet Timet-Timeline来显示预处理决定。案例研究包括软件缺陷数据和收入调查基准数据,其中预处理影响数据丢失的程度,在预测中可以改进时间分析的任务。