Identifying the relevant variables for a classification model with correct confidence levels is a central but difficult task in high-dimension. Despite the core role of sparse logistic regression in statistics and machine learning, it still lacks a good solution for accurate inference in the regime where the number of features $p$ is as large as or larger than the number of samples $n$. Here, we tackle this problem by improving the Conditional Randomization Test (CRT). The original CRT algorithm shows promise as a way to output p-values while making few assumptions on the distribution of the test statistics. As it comes with a prohibitive computational cost even in mildly high-dimensional problems, faster solutions based on distillation have been proposed. Yet, they rely on unrealistic hypotheses and result in low-power solutions. To improve this, we propose \emph{CRT-logit}, an algorithm that combines a variable-distillation step and a decorrelation step that takes into account the geometry of $\ell_1$-penalized logistic regression problem. We provide a theoretical analysis of this procedure, and demonstrate its effectiveness on simulations, along with experiments on large-scale brain-imaging and genomics datasets.
翻译:确定具有正确信任度的分类模型的相关变量是高差异中一项核心但困难的任务。尽管统计和机器学习中后勤回归稀少的核心作用是统计和机器学习的后勤回归稀少,但它仍然缺乏一个精确推断的好解决方案,因为当特征数量与样本数量一样大或大于美元时,美元仍无法准确推断。这里,我们通过改进条件随机化测试(CRT)来解决这个问题。最初的CRT算法显示,在对测试统计数据的分布进行很少的假设的同时,作为输出 p值的一种方式是有希望的。由于它涉及一个令人难以承受的计算成本,即使是在轻微的高度问题中也是如此,因此提出了基于蒸馏的更快的解决方案。然而,它们依赖不切实际的假设并导致低功率解决方案。为了改进这一点,我们建议采用\emph{CRT-logit}这一算法,结合了可变的蒸馏步骤和折现的调和调和步骤,考虑到美元1美元的几何测算性物流回归问题。我们对这一程序进行了理论分析分析,并展示了其在大尺度上进行模拟和脑模拟的数据的有效性。