Medical datasets are particularly subject to attribute noise, that is, missing and erroneous values. Attribute noise is known to be largely detrimental to learning performances. To maximize future learning performances it is primordial to deal with attribute noise before any inference. We propose a simple autoencoder-based preprocessing method that can correct mixed-type tabular data corrupted by attribute noise. No other method currently exists to handle attribute noise in tabular data. We experimentally demonstrate that our method outperforms both state-of-the-art imputation methods and noise correction methods on several real-world medical datasets.
翻译:医学数据集特别容易受属性噪音的影响,即缺少和错误的值。已知属性噪音对学习表现大有危害。为了最大限度地提高未来的学习表现,首先必须在任何推断之前处理属性噪音。我们建议一种简单的基于自动编码器的预处理方法,可以纠正因属性噪音而腐蚀的混合型表格数据。目前没有其他方法来处理列表数据中的属性噪音。我们实验性地证明,我们的方法在几个真实世界的医学数据集中都优于最先进的估算方法和噪音校正方法。