Designing categorical kernels is a major challenge for Gaussian process regression with continuous and categorical inputs. Despite previous studies, it is difficult to identify a preferred method, either because the evaluation metrics, the optimization procedure, or the datasets change depending on the study. In particular, reproducible code is rarely available. The aim of this paper is to provide a reproducible comparative study of all existing categorical kernels on many of the test cases investigated so far. We also propose new evaluation metrics inspired by the optimization community, which provide quantitative rankings of the methods across several tasks. From our results on datasets which exhibit a group structure on the levels of categorical inputs, it appears that nested kernels methods clearly outperform all competitors. When the group structure is unknown or when there is no prior knowledge of such a structure, we propose a new clustering-based strategy using target encodings of categorical variables. We show that on a large panel of datasets, which do not necessarily have a known group structure, this estimation strategy still outperforms other approaches while maintaining low computational cost.
翻译:设计分类核函数是处理连续与分类输入的高斯过程回归的主要挑战。尽管已有先前研究,但由于评估指标、优化过程或数据集在不同研究中存在差异,难以确定一种优选方法。特别是,可复现代码鲜有提供。本文旨在对现有所有分类核函数在迄今研究的大量测试案例上进行可重复的比较研究。我们还提出了受优化领域启发的新评估指标,这些指标能够为多种任务中的方法提供定量排序。根据我们在分类输入层级呈现分组结构的数据集上的结果,嵌套核方法明显优于所有竞争方法。当分组结构未知或不存在此类结构的先验知识时,我们提出了一种使用分类变量目标编码的新型基于聚类的策略。我们证明,在大量未必具有已知分组结构的数据集上,这种估计策略在保持较低计算成本的同时,仍优于其他方法。