Sparse prediction with categorical data is challenging even for a moderate number of variables, because one parameter is roughly needed to encode one category or level. The Group Lasso is a well known efficient algorithm for selection continuous or categorical variables, but all estimates related to a selected factor usually differ, so a fitted model may not be sparse. To make the Group Lasso solution sparse, we propose to merge levels of the selected factor, if a difference between its corresponding estimates is less than some predetermined threshold. We prove that under weak conditions our algorithm, called GLAMER for Group LAsso MERger, recovers the true, sparse linear or logistic model even for the high-dimensional scenario, that is when a number of parameters is greater than a learning sample size. To our knowledge, selection consistency has been proven many times for different algorithms fitting sparse models with categorical variables, but our result is the first for the high-dimensional scenario. Numerical experiments show the satisfactory performance of the GLAMER.
翻译:使用绝对数据进行粗略的预测,即使对于数量不多的变量也是具有挑战性的,因为对某一类别或级别进行编码,大致需要一个参数。Lasso集团是一个众所周知的用于选择连续或绝对变量的有效算法,但与选定因素有关的所有估计通常各不相同,因此,一个合适的模型可能不会稀释。为了使Lasso集团的解决方案变得稀少,我们提议合并选定因素的等级,如果相应的估计值之间的差别低于某些预先确定的阈值。我们证明,在薄弱的条件下,我们的算法,即GLAMER(G Group Lasso MERger的GLAAMER),恢复了真实的、稀少的线性或后勤性模型,即使是在高维情景下,也就是当一些参数大于学习抽样大小时。据我们所知,对于与绝对变量相匹配的稀少模型,选择一致性已被证明很多次,但我们的结果是高维假设的首个。数字实验显示了GLAMER的令人满意的性表现。