Categorical data are present in key areas such as health or supply chain, and this data require specific treatment. In order to apply recent machine learning models on such data, encoding is needed. In order to build interpretable models, one-hot encoding is still a very good solution, but such encoding creates sparse data. Gradient estimators are not suited for sparse data: the gradient is mainly considered as zero while it simply does not always exists, thus a novel gradient estimator is introduced. We show what this estimator minimizes in theory and show its efficiency on different datasets with multiple model architectures. This new estimator performs better than common estimators under similar settings. A real world retail dataset is also released after anonymization. Overall, the aim of this paper is to thoroughly consider categorical data and adapt models and optimizers to these key features.
翻译:分类数据广泛应用于健康、供应链等关键领域,需要进行特定处理。为了在此类数据上应用最近的机器学习模型,需要进行编码。为构建可解释的模型,独热编码仍然是一个非常好的解决方案,但是这种编码会创建稀疏数据。梯度估计器不适用于稀疏数据:梯度主要被视为零,而实际上不一定存在,因此引入了一种新的梯度估计器。我们展示了该估计器在理论上所最小化的内容,并在具有多个模型架构的不同数据集上展示其效率。在相似的设置下,这个新估计器表现比常见的估计器更好。我们还发布了一个匿名的真实零售数据集。总体上,本文的目的是深入考虑分类数据,并使模型和优化器适应这些关键特征。