In text classification, the problem of overfitting arises due to the high dimensionality, making regularization essential. Although classic regularizers provide sparsity, they fail to return highly accurate models. On the contrary, state-of-the-art group-lasso regularizers provide better results at the expense of low sparsity. In this paper, we apply a greedy variable selection algorithm, called Orthogonal Matching Pursuit, for the text classification task. We also extend standard group OMP by introducing overlapping group OMP to handle overlapping groups of features. Empirical analysis verifies that both OMP and overlapping GOMP constitute powerful regularizers, able to produce effective and super-sparse models. Code and data are available at: https://www.dropbox.com/sh/7w7hjns71ol0xrz/AAC\string_G0\string_0DlcGkq6tQb2zqAaca\string?dl\string=0 .
翻译:在文本分类中,由于高维度,过度装配问题出现于文本分类中,使规范化变得必不可少。虽然典型的正规化者提供宽度,但他们未能返回非常准确的模型。相反,最先进的群集-lasso正规化者以低宽度为代价提供了更好的结果。在本文中,我们为文本分类任务采用了一种贪婪的变量选择算法,称为“正弦匹配追求”。我们还通过引入重叠的组 OMP 来扩展标准组 OMP, 处理重叠的特征组 。 经验分析证实 OMP 和重叠的 GOMP 构成强大的正规化者, 能够产生有效和超精细的模型。 代码和数据可以在 https://www.droppox.com/sh/7w7jjnsol710xrz/ AACstring_G0\string_0DlcGkqq6tQ2zq\\qstring? dl\string=0 。