This paper develops a new approach to post-selection inference for screening high-dimensional predictors of survival outcomes. Post-selection inference for right-censored outcome data has been investigated in the literature, but much remains to be done to make the methods both reliable and computationally-scalable in high-dimensions. Machine learning tools are commonly used to provide {\it predictions} of survival outcomes, but the estimated effect of a selected predictor suffers from confirmation bias unless the selection is taken into account. The new approach involves construction of semi-parametrically efficient estimators of the linear association between the predictors and the survival outcome, which are used to build a test statistic for detecting the presence of an association between any of the predictors and the outcome. Further, a stabilization technique reminiscent of bagging allows a normal calibration for the resulting test statistic, which enables the construction of confidence intervals for the maximal association between predictors and the outcome and also greatly reduces computational cost. Theoretical results show that this testing procedure is valid even when the number of predictors grows superpolynomially with sample size, and our simulations support that this asymptotic guarantee is indicative the performance of the test at moderate sample sizes. The new approach is applied to the problem of identifying patterns in viral gene expression associated with the potency of an antiviral drug.
翻译:本文为筛选生存结果的高维预测数据开发了一种新的方法。 文献中已经调查了对正确检查结果数据进行后选的推断,但要使方法可靠和可计算在高层次上都可扩展。 机械学习工具通常用于提供生存结果的正常校准,但所选预测器的估计效果有确认偏差,除非考虑到选择的结果。 新方法涉及建造预测器和生存结果之间线性联系的半对称有效估测器,用于建立检测任何预测器和结果之间是否存在关联的测试统计。 此外,一个稳定技术迷惑能为由此产生的测试统计提供正常的校准,从而能够构建预测器和结果之间最大联系的信任间隔,并大大降低计算成本。 理论结果显示,即使预测器数量增长时,预测器和生存结果之间线性联系的半对准有效估测器,用于建立检测任何预测器与任何结果和结果之间关联的测试结果的测试数据。 此外,一个稳定技术的感应感应使结果得到正常校准,因此,在试样的样品上,在测试中,这种测试的样本表现是一种新型的保证。