The problem of identifying the most discriminating features when performing supervised learning has been extensively investigated. In particular, several methods for variable selection in model-based classification have been proposed. Surprisingly, the impact of outliers and wrongly labeled units on the determination of relevant predictors has received far less attention, with almost no dedicated methodologies available in the literature. In the present paper, we introduce two robust variable selection approaches: one that embeds a robust classifier within a greedy-forward selection procedure and the other based on the theory of maximum likelihood estimation and irrelevance. The former recasts the feature identification as a model selection problem, while the latter regards the relevant subset as a model parameter to be estimated. The benefits of the proposed methods, in contrast with non-robust solutions, are assessed via an experiment on synthetic data. An application to a high-dimensional classification problem of contaminated spectroscopic data concludes the paper.
翻译:已经广泛调查了在开展监督学习时确定最有区别特征的问题,特别是提出了在基于模型的分类中选择变量的几种方法,令人惊讶的是,外部线和标签错误的单位对确定相关预测器的影响受到的关注要少得多,文献中几乎没有专门的方法。在本文件中,我们采用了两种强有力的变量选择方法:一种是在贪婪的前向选择程序中嵌入一个强有力的分类器,另一种是基于最大可能性估计和不相干理论。前一种将特征识别重新定位作为一个模式选择问题,而后一种则将相关子集视为一个有待估计的模型参数。拟议方法与非紫外线解决方案不同,通过合成数据实验评估了拟议方法的效益。对受污染的光谱层数据的高度分类问题的应用使文件结束。