The selection of essential variables in logistic regression is vital because of its extensive use in medical studies, finance, economics and related fields. In this paper, we explore four main typologies (test-based, penalty-based, screening-based, and tree-based) of frequentist variable selection methods in logistic regression setup. Primary objective of this work is to give a comprehensive overview of the existing literature for practitioners. Underlying assumptions and theory, along with the specifics of their implementations, are detailed as well. Next, we conduct a thorough simulation study to explore the performances of fifteen different methods in terms of variable selection, estimation of coefficients, prediction accuracy as well as time complexity under various settings. We take low, moderate and high dimensional setups and consider different correlation structures for the covariates. A real-life application, using a high-dimensional gene expression data, is also included in this study to further understand the efficacy and consistency of the methods. Finally, based on our findings in the simulated data and in the real data, we provide recommendations for practitioners on the choice of variable selection methods under various contexts.
翻译:选择后勤回归中的基本变量至关重要,因为它广泛用于医学研究、金融、经济学和相关领域。在本文件中,我们探讨了物流回归设置中常见变量选择方法的四种主要类型(测试、惩罚、筛选和树本),这项工作的主要目的是为从业人员全面概述现有文献。基础假设和理论及其实施的具体细节也得到了详细阐述。接着,我们进行了彻底的模拟研究,以探讨不同情况下在变量选择、系数估计、预测准确性和时间复杂性方面的十五种不同方法的性能。我们采用了低、中度和高维设置,并考虑了共变体的不同关联结构。这项研究中还包括了一种真实生活应用,使用高维的基因表达数据,以进一步理解这些方法的功效和一致性。最后,根据我们在模拟数据和真实数据中的调查结果,我们向从业人员提供在不同情况下选择不同选择方法的建议。