Many learning algorithms require categorical data to be transformed into real vectors before it can be used as input. Often, categorical variables are encoded as one-hot (or dummy) vectors. However, this mode of representation can be wasteful since it adds many low-signal regressors, especially when the number of unique categories is large. In this paper, we investigate simple alternative solutions for universally consistent estimators that rely on lower-dimensional real-valued representations of categorical variables that are "sufficient" in the sense that no predictive information is lost. We then compare preexisting and proposed methods on simulated and observational datasets.
翻译:许多学习算法要求将绝对数据转换成真实的矢量,然后才能用作输入。通常,绝对变量被编码为单热矢量(或假的)矢量。然而,这种表达方式可能浪费,因为它增加了许多低发递减器,特别是当独特类别的数量很大时。在本文中,我们为那些依赖低维度实际价值的绝对变量“充分”的“实际价值”表示的通用一致估算器寻找简单的替代解决方案。我们随后比较了模拟和观察数据集的原有和拟议方法。