Datasets containing both categorical and continuous variables are frequently encountered in many areas, and with the rapid development of modern measurement technologies, the dimensions of these variables can be very high. Despite the recent progress made in modelling high-dimensional data for continuous variables, there is a scarcity of methods that can deal with a mixed set of variables. To fill this gap, this paper develops a novel approach for classifying high-dimensional observations with mixed variables. Our framework builds on a location model, in which the distributions of the continuous variables conditional on categorical ones are assumed Gaussian. We overcome the challenge of having to split data into exponentially many cells, or combinations of the categorical variables, by kernel smoothing, and provide new perspectives for its bandwidth choice to ensure an analogue of Bochner's Lemma, which is different to the usual bias-variance tradeoff. We show that the two sets of parameters in our model can be separately estimated and provide penalized likelihood for their estimation. Results on the estimation accuracy and the misclassification rates are established, and the competitive performance of the proposed classifier is illustrated by extensive simulation and real data studies.
翻译:包含绝对和连续变量的数据集在许多领域经常遇到,随着现代测量技术的迅速发展,这些变量的方方面面可能非常高。尽管最近在为连续变量建立高维数据模型方面取得了进展,但缺乏处理混合变量的方法。为填补这一空白,本文件开发了一种新颖的方法,用混合变量对高维观测进行分类。我们的框架基于一个位置模型,假设以绝对变量为条件的连续变量的分布是高山。我们克服了将数据分解成指数性多的细胞或绝对变量组合的挑战,我们通过内核平滑,为其带宽选择提供了新视角,以确保与通常的偏差取舍取舍不同的Bochner's Lemma的类似。我们表明,我们模型中的两套参数可以分别估算,并为估算提供受罚的可能性。关于估算准确性和分类错误率的结果已经确立,并且通过广泛的模拟和真实数据研究来说明拟议的分类师的竞争性表现。