Data leakage is a well-known problem in machine learning. Data leakage occurs when information from outside the training dataset is used to create a model. This phenomenon renders a model excessively optimistic or even useless in the real world since the model tends to leverage greatly on the unfairly acquired information. To date, detection of data leakages occurs post-mortem using run-time methods. However, due to the insidious nature of data leakage, it may not be apparent to a data scientist that a data leakage has occurred in the first place. For this reason, it is advantageous to detect data leakages as early as possible in the development life cycle. In this paper, we propose a novel static analysis to detect several instances of data leakages during development time. We define our analysis using the framework of abstract interpretation: we define a concrete semantics that is sound and complete, from which we derive a sound and computable abstract semantics. We implement our static analysis inside the open-source NBLyzer static analysis framework and demonstrate its utility by evaluating its performance and precision on over 2000 Kaggle competition notebooks.
翻译:数据渗漏是机器学习中众所周知的一个问题。当从培训数据集之外的信息被用于创建模型时,数据渗漏就会发生。这种现象使得模型过于乐观,在现实世界中甚至毫无用处,因为模型往往会大大利用不公平获得的信息。迄今为止,数据渗漏的检测是在死后用运行时间方法进行的。然而,由于数据渗漏的隐蔽性质,数据渗漏首先对数据科学家来说可能并不明显。因此,在开发寿命周期内尽早发现数据渗漏是有好处的。在本文件中,我们提议进行新的静态分析,以探测开发期间数据渗漏的几例。我们使用抽象解释框架来界定我们的分析:我们定义了正确和完整的具体语义,从中我们从中得出一个声音和可比较的抽象语义。我们在开放源的NBLyzer静态分析框架内进行静态分析,并通过评价其性能和精确度来证明它在2000年的Kagle竞争笔记本中的实用性。