Class imbalance is a common issue in many domain applications of learning algorithms. Oftentimes, in the same domains it is much more relevant to correctly classify and profile minority class observations. This need can be addressed by Feature Selection (FS), that offers several further advantages, s.a. decreasing computational costs, aiding inference and interpretability. However, traditional FS techniques may become sub-optimal in the presence of strongly imbalanced data. To achieve FS advantages in this setting, we propose a filtering FS algorithm ranking feature importance on the basis of the Reconstruction Error of a Deep Sparse AutoEncoders Ensemble (DSAEE). We use each DSAE trained only on majority class to reconstruct both classes. From the analysis of the aggregated Reconstruction Error, we determine the features where the minority class presents a different distribution of values w.r.t. the overrepresented one, thus identifying the most relevant features to discriminate between the two. We empirically demonstrate the efficacy of our algorithm in several experiments on high-dimensional datasets of varying sample size, showcasing its capability to select relevant and generalizable features to profile and classify minority class, outperforming other benchmark FS methods. We also briefly present a real application in radiogenomics, where the methodology was applied successfully.
翻译:班级不平衡是许多学习算法领域应用中常见的一个问题。通常,在同一领域,它更适合于正确分类和描述少数类的观察。这种需要可以通过地物选择(FS)来解决,这种选择具有若干进一步的好处,例如计算成本不断下降,有助于推论和解释。然而,传统FS技术在数据严重不平衡的情况下可能变得不尽人意。为了在这一背景下实现FS优势,我们建议根据深度微调自动电子集成(DSAEE)的重建错误,过滤FS算法的排序具有重要性。我们利用每个只受过多数级培训的DSAE来重建这两个班。我们从对总体重建错误的分析中,确定少数类的特征,这些特征对数值的分布有不同的 w.r.t., 代表比例过高的类别,从而确定两者之间最相关的区别特征。为了实现这一优势,我们实证地展示了我们的算法的功效,根据不同抽样规模的高度数据集的重建错误,展示了它选择相关和可概括性特征的能力,对少数群体类进行分级和分类。我们从其他无线电应用的方法中可以成功应用。