Sparse modelling or model selection with categorical data is challenging even for a moderate number of variables, because one parameter is roughly needed to encode one category or level. The Group Lasso is a well known efficient algorithm for selection continuous or categorical variables, but all estimates related to a selected factor usually differ. Therefore, a fitted model may not be sparse, which makes the model interpretation difficult. To obtain a sparse solution of the Group Lasso we propose the following two-step procedure: first, we reduce data dimensionality using the Group Lasso; then to choose the final model we use an information criterion on a small family of models prepared by clustering levels of individual factors. We investigate selection correctness of the algorithm in a sparse high-dimensional scenario. We also test our method on synthetic as well as real datasets and show that it performs better than the state of the art algorithms with respect to the prediction accuracy or model dimension.
翻译:即使对于数量不多的变数来说,使用绝对数据进行粗略的建模或模型选择也具有挑战性,因为对于一个类别或层次的编码,大致需要有一个参数。Lasso集团是一个众所周知的用于选择连续或绝对变量的有效算法,但所有与选定因素有关的估计通常各不相同。因此,一个合适的模型可能并不稀疏,因此模型解释难于使用。为了获得Lasso集团的稀疏解决方案,我们建议采用以下两步程序:首先,我们使用Lasso集团来减少数据维度;然后选择我们使用的信息标准来选择一个最后模型,我们使用由个别因素组合层次所制作的模型组成的小系列信息标准。我们调查在稀疏高维情景中选择算法的正确性。我们还在合成和真实数据集方面测试我们的方法,并显示它比预测准确性或模型维度的先进算法状态要好。