We propose a method for estimation in high-dimensional linear models with nominal categorical data. Our estimator, called SCOPE, fuses levels together by making their corresponding coefficients exactly equal. This is achieved using the minimax concave penalty on differences between the order statistics of the coefficients for a categorical variable, thereby clustering the coefficients. We provide an algorithm for exact and efficient computation of the global minimum of the resulting nonconvex objective in the case with a single variable with potentially many levels, and use this within a block coordinate descent procedure in the multivariate case. We show that an oracle least squares solution that exploits the unknown level fusions is a limit point of the coordinate descent with high probability, provided the true levels have a certain minimum separation; these conditions are known to be minimal in the univariate case. We demonstrate the favourable performance of SCOPE across a range of real and simulated datasets. An R package CatReg implementing SCOPE for linear models and also a version for logistic regression is available on CRAN.
翻译:我们建议了高维线性模型的估算方法,并附有名义绝对数据。 我们的测算器称为SCOPE, 引信水平, 使相应的系数完全相等 。 这是使用对绝对变量系数的顺序统计差异的迷你式混合罚款来实现的, 从而将系数组合在一起 。 我们提供了一个算法, 精确和高效地计算由此得出的全球最低非电离目标, 并使用一个单一变量, 可能具有多层次, 并在多变量案例中使用这个块协调下降程序 。 我们显示, 利用未知水平聚合的最小极小的方块, 极有可能是协调下降的极限点, 只要真实水平有一定的最低分数; 这些条件在单词中是已知的最低值 。 我们展示了SAPE在一系列真实和模拟数据集中的有利性表现 。 一个名为 CatReg 的软件包, 在线性模型中应用SAPE, 并在 CRAN 上有一个逻辑回归的版本 。