In recent years, deep learning has been a topic of interest in almost all disciplines due to its impressive empirical success in analyzing complex data sets, such as imaging, genetics, climate, and medical data. While most of the developments are treated as black-box machines, there is an increasing interest in interpretable, reliable, and robust deep learning models applicable to a broad class of applications. Feature-selected deep learning is proven to be promising in this regard. However, the recent developments do not address the situations of ultra-high dimensional and highly correlated feature selection in addition to the high noise level. In this article, we propose a novel screening and cleaning strategy with the aid of deep learning for the cluster-level discovery of highly correlated predictors with a controlled error rate. A thorough empirical evaluation over a wide range of simulated scenarios demonstrates the effectiveness of the proposed method by achieving high power while having a minimal number of false discoveries. Furthermore, we implemented the algorithm in the riboflavin (vitamin $B_2$) production dataset in the context of understanding the possible genetic association with riboflavin production. The gain of the proposed methodology is illustrated by achieving lower prediction error compared to other state-of-the-art methods.
翻译:近年来,由于在分析成像、遗传学、气候和医疗数据等复杂数据集方面取得令人印象深刻的经验性成功,深层次学习已成为几乎所有学科都感兴趣的一个专题。虽然大多数发展都被当作黑箱机器处理,但人们日益关注适用于广泛应用类别的可解释、可靠和强有力的深层次学习模式。在这方面,经精选的深层次学习证明很有希望。然而,最近的事态发展除了高噪音水平之外,没有涉及超高维度和高度关联特征选择的情况。在本篇文章中,我们提议了一项新颖的筛选和清洁战略,在深入学习的基础上,发现集群一级高度关联且有控制误率的预测器。对一系列广泛的模拟情景进行彻底的经验性评价,通过在获得高功率的同时获得少量的虚假发现,表明拟议方法的有效性。此外,我们在了解与里夫拉文生产可能存在的遗传联系时,在里夫拉文(vimin $B_2$2美元)生产数据集方面采用了算法。相对于其他状态的低误测法,可以说明拟议方法的得益。