Missing values are a common problem in data science and machine learning. Removing instances with missing values can adversely affect the quality of further data analysis. This is exacerbated when there are relatively many more features than instances, and thus the proportion of affected instances is high. Such a scenario is common in many important domains, for example, single nucleotide polymorphism (SNP) datasets provide a large number of features over a genome for a relatively small number of individuals. To preserve as much information as possible prior to modeling, a rigorous imputation scheme is acutely needed. While Denoising Autoencoders is a state-of-the-art method for imputation in high-dimensional data, they still require enough complete cases to be trained on which is often not available in real-world problems. In this paper, we consider missing value imputation as a multi-label classification problem and propose Chains of Autoreplicative Random Forests. Using multi-label Random Forests instead of neural networks works well for low-sampled data as there are fewer parameters to optimize. Experiments on several SNP datasets show that our algorithm effectively imputes missing values based only on information from the dataset and exhibits better performance than standard algorithms that do not require any additional information. In this paper, the algorithm is implemented specifically for SNP data, but it can easily be adapted for other cases of missing value imputation.
翻译:缺少的值是数据科学和机器学习的一个常见问题。 消除缺少值的情况可能会对进一步数据分析的质量产生不利影响。 当比实例多得多的特征比实例多得多时, 情况就会恶化。 这样的情况在许多重要领域是常见的, 比如, 单核核酸多形态( SNP) 数据集提供了基因组上大量特征, 给数量相对较少的个人提供多标签分类问题, 并提出“ 自动复制随机森林 ” 链。 使用多标签随机森林而不是神经网络来保存尽可能多的信息, 非常需要严格的估算方案。 虽然 Denoising Autoencoders是高维度数据中最先进的估算方法, 但仍然需要足够完整的案例才能被培训, 而现实世界问题中往往无法找到。 在本文中, 我们把缺失值作为多标签分类问题, 并提出“ 自动复制随机森林 ” 链。 使用多标签随机森林 来保存大量信息, 很容易得到优化的参数。 在一些 SPP 数据设置上进行实验表明, 我们的算算算算方法能够有效地调整其他缺少的数据。