数据犯罪:培训机机学习算法可能导致过于乐观的结果 (Subtle Data Crimes: Naively training machine learning algorithms could lead to overly-optimistic results)

While open databases are an important resource in the Deep Learning (DL) era, they are sometimes used "off-label": data published for one task are used for training algorithms for a different one. This work aims to highlight that in some cases, this common practice may lead to biased, overly-optimistic results. We demonstrate this phenomenon for inverse problem solvers and show how their biased performance stems from hidden data preprocessing pipelines. We describe two preprocessing pipelines typical of open-access databases and study their effects on three well-established algorithms developed for Magnetic Resonance Imaging (MRI) reconstruction: Compressed Sensing (CS), Dictionary Learning (DictL), and DL. In this large-scale study we performed extensive computations. Our results demonstrate that the CS, DictL and DL algorithms yield systematically biased results when naively trained on seemingly-appropriate data: the Normalized Root Mean Square Error (NRMSE) improves consistently with the preprocessing extent, showing an artificial increase of 25%-48% in some cases. Since this phenomenon is generally unknown, biased results are sometimes published as state-of-the-art; we refer to that as subtle data crimes. This work hence raises a red flag regarding naive off-label usage of Big Data and reveals the vulnerability of modern inverse problem solvers to the resulting bias.

翻译：虽然开放数据库是深层学习(DL)时代的一个重要资源,但有时它们被“关闭标签”使用:为一项任务公布的数据被用于不同任务的培训算法。这项工作旨在强调,在某些情况下,这种常见做法可能导致偏向,过于乐观的结果。我们向反问题解答者展示了这种现象,并展示了它们有偏差的性能如何产生于隐蔽的数据处理前管道。我们描述了两个以开放访问数据库为典型的预处理管道,并研究了它们对为磁共振成像(MRI)重建而开发的三种完善的算法的影响:压缩感测(CS)、词典学(DictL)和DL。在本次大规模研究中,我们进行了广泛的计算。我们的结果显示,当对貌似适当的数据进行天真的培训时,CS、DictL和DL算法会产生系统性的偏差结果。我们描述的是,正常的原始平方错误(NRMSE)与预处理前的程度一致地改善了它们的效果, 在某些案例中显示人为增加了25-48 % 。由于这个现象是未知的,因此,因此, 变相偏差的结果数据显示, 变的变的变的变的变的变的变的变的变的数据结果,结果有时的变的变。