Data science pipelines to train and evaluate models with machine learning may contain bugs just like any other code. Leakage between training and test data can lead to overestimating the model's accuracy during offline evaluations, possibly leading to deployment of low-quality models in production. Such leakage can happen easily by mistake or by following poor practices, but may be tedious and challenging to detect manually. We develop a static analysis approach to detect common forms of data leakage in data science code. Our evaluation shows that our analysis accurately detects data leakage and that such leakage is pervasive among over 100,000 analyzed public notebooks. We discuss how our static analysis approach can help both practitioners and educators, and how leakage prevention can be designed into the development process.
翻译:用于训练和评价机器学习模型的数据科学管道可能同任何其他代码一样含有错误。培训和测试数据之间的渗漏可能导致在离线评价中高估模型的准确性,可能导致在生产过程中部署低质量模型。这种渗漏很容易发生于错误,也可能发生于不良做法,但可能很乏味,难以人工检测。我们开发了静态分析方法,以探测数据科学代码中常见的数据渗漏形式。我们的评估表明,我们的分析准确地检测了数据渗漏,这种渗漏在10万多本经过分析的公共笔记本中十分普遍。我们讨论了我们的静态分析方法如何帮助从业人员和教育工作者,以及如何将防止渗漏设计到发展进程中。