Nowadays, several data analysis problems require for complexity reduction, mainly meaning that they target at removing the non-influential covariates from the model and at delivering a sparse model. When categorical covariates are present, with their levels being dummy coded, the number of parameters included in the model grows rapidly, fact that emphasizes the need for reducing the number of parameters to be estimated. In this case, beyond variable selection, sparsity is also achieved through fusion of levels of covariates which do not differentiate significantly in terms of their influence on the response variable. In this work a new regularization technique is introduced, called $L_{0}$-Fused Group Lasso ($L_{0}$-FGL) for binary logistic regression. It uses a group lasso penalty for factor selection and for the fusion part it applies an $L_{0}$ penalty on the differences among the levels' parameters of a categorical predictor. Using adaptive weights, the adaptive version of $L_{0}$-FGL method is derived. Theoretical properties, such as the existence, $\sqrt{n}$ consistency and oracle properties under certain conditions, are established. In addition, it is shown that even in the diverging case where the number of parameters $p_{n}$ grows with the sample size $n$, $\sqrt{n}$ consistency and a consistency in variable selection result are achieved. Two computational methods, PIRLS and a block coordinate descent (BCD) approach using quasi Newton, are developed and implemented. A simulation study supports that $L_{0}$-FGL shows an outstanding performance, especially in the high dimensional case.
翻译:目前,一些数据分析问题要求降低复杂性,这主要意味着它们的目标是从模型中去除非内流共变值,并且提供一种稀薄的模式。当存在绝对共变值时,随着其水平被模拟编码,模型中包含的参数数量迅速增长,这突出表明需要减少要估计的参数数量。在此情况下,除了选择变量之外,还通过混合共变水平来实现宽度,这些水平在对响应变量的影响方面没有显著区别。在这项工作中,采用了一种新的正规化技术,称为拉索($+0美元)和拉索($+0美元-FGL)组合,用于二进化物流回归。模型中包含的参数选择使用一组拉索罚款,对于组合部分则对绝对预测值的等级参数差异适用$+0美元罚款。使用适应性加权,调适量版的美元====美元=FL}-GL} 方法是推算的。理论上的属性,例如存在, 美元=美元=美元xxxxxxxxxxxxxxxxxxxxxxxxxxx 。