We propose a method to reduce the complexity of Generalized Linear Models in the presence of categorical predictors. The traditional one-hot encoding, where each category is represented by a dummy variable, can be wasteful, difficult to interpret, and prone to overfitting, especially when dealing with high-cardinality categorical predictors. This paper addresses these challenges by finding a reduced representation of the categorical predictors by clustering their categories. This is done through a numerical method which aims to preserve (or even, improve) accuracy, while reducing the number of coefficients to be estimated for the categorical predictors. Thanks to its design, we are able to derive a proximity measure between categories of a categorical predictor that can be easily visualized. We illustrate the performance of our approach in real-world classification and count-data datasets where we see that clustering the categorical predictors reduces complexity substantially without harming accuracy.
翻译:我们建议了一种在绝对预测器面前降低通用线性模型复杂性的方法。 传统的单热编码(每个类别都由假变量代表)可能是浪费的、难以解释的,而且容易过度适应,特别是在处理高心率绝对预测器时。 本文通过将绝对预测器的类别组合在一起,从而找到减少其代表性的方法来应对这些挑战。 这是通过数字方法实现的,其目的是保持(甚至提高)准确性,同时减少对绝对预测器的估计系数数量。 由于其设计,我们能够在直线预测器的类别之间得出一种近距离的测量,而这种分类很容易被视觉化。 我们在现实世界分类和数数数据集中展示了我们的方法表现,我们看到将绝对预测器组合在一起会大大降低复杂性,而不会损害准确性。