Categorical data are present in key areas such as health or supply chain, and this data require specific treatment. In order to apply recent machine learning models on such data, encoding is needed. In order to build interpretable models, one-hot encoding is still a very good solution, but such encoding creates sparse data. Gradient estimators are not suited for sparse data: the gradient is mainly considered as zero while it simply does not always exists, thus a novel gradient estimator is introduced. We show what this estimator minimizes in theory and show its efficiency on different datasets with multiple model architectures. This new estimator performs better than common estimators under similar settings. A real world retail dataset is also released after anonymization. Overall, the aim of this paper is to thoroughly consider categorical data and adapt models and optimizers to these key features.
翻译:在健康或供应链等关键领域存在分类数据,而这些数据需要特定的处理。为了在这些数据上应用最近的机器学习模型,需要编码。为了建立可解释的模型,单热编码仍是一个非常好的解决方案,但这种编码会创造稀有的数据。渐变估计器不适合稀少的数据:梯度主要被视为零,而它根本不总是存在的,因此引入了一个新的梯度估计器。我们展示了这个估计器在理论中如何在多模型结构的不同数据集中最小化,并展示了它的效率。这个新的估计器比类似环境中的通用估计器表现得更好。一个真实的世界零售数据集在匿名后也被释放。总的来说,本文件的目的是彻底考虑绝对数据,并对这些关键特征调整模型和优化模型。