Deep neural networks (DNN) have been used successfully in many scientific problems for their high prediction accuracy, but their application to genetic studies remains challenging due to their poor interpretability. In this paper, we consider the problem of scalable, robust variable selection in DNN for the identification of putative causal genetic variants in genome sequencing studies. We identified a pronounced randomness in feature selection in DNN due to its stochastic nature, which may hinder interpretability and give rise to misleading results. We propose an interpretable neural network model, stabilized using ensembling, with controlled variable selection for genetic studies. The merit of the proposed method includes: (1) flexible modelling of the non-linear effect of genetic variants to improve statistical power; (2) multiple knockoffs in the input layer to rigorously control false discovery rate; (3) hierarchical layers to substantially reduce the number of weight parameters and activations to improve computational efficiency; (4) de-randomized feature selection to stabilize identified signals. We evaluated the proposed method in extensive simulation studies and applied it to the analysis of Alzheimer disease genetics. We showed that the proposed method, when compared to conventional linear and nonlinear methods, can lead to substantially more discoveries.
翻译:深神经网络(DNN)在许多科学问题上被成功地用于许多高预测精确度的科学问题,但是,由于遗传研究的可解释性差,这些网络在遗传研究中的应用仍然具有挑战性。在本文件中,我们考虑了DNN为确定基因组测序研究中的推定因果遗传变异物而在DNN中进行可扩缩、稳健的变量选择的问题。我们发现DNN的特征选择明显随机性,因为其具有随机性,可能妨碍解释性并产生误导性结果。我们提出了一个可解释的神经网络模型,使用混合法稳定下来,对遗传学研究进行有控制的变量选择。拟议方法的优点包括:(1) 灵活模拟基因变异物的非线性效应,以提高统计能力;(2) 在输入层中进行多次倒置,以严格控制虚假的发现率;(3) 等级层,以大幅度减少重量参数和激活量,以提高计算效率;(4) 去除随机特性,以稳定已查明的信号。我们在广泛的模拟研究中评价了拟议的方法,并将它应用于对阿尔茨海氏病遗传学的分析。我们表明,拟议的方法与常规的线性和非线性方法相比,可以在很大程度上导致。