在标签噪音和外部离子分类框架内强有力变量选择:农业食品中的光谱数据应用 (Robust variable selection in the framework of classification with label noise and outliers: applications to spectroscopic data in agri-food)

Classification of high-dimensional spectroscopic data is a common task in analytical chemistry. Well-established procedures like support vector machines (SVMs) and partial least squares discriminant analysis (PLS-DA) are the most common methods for tackling this supervised learning problem. Nonetheless, interpretation of these models remains sometimes difficult, and solutions based on feature selection are often adopted as they lead to the automatic identification of the most informative wavelengths. Unfortunately, for some delicate applications like food authenticity, mislabeled and adulterated spectra occur both in the calibration and/or validation sets, with dramatic effects on the model development, its prediction accuracy and robustness. Motivated by these issues, the present paper proposes a robust model-based method that simultaneously performs variable selection, outliers and label noise detection. We demonstrate the effectiveness of our proposal in dealing with three agri-food spectroscopic studies, where several forms of perturbations are considered. Our approach succeeds in diminishing problem complexity, identifying anomalous spectra and attaining competitive predictive accuracy considering a very low number of selected wavelengths.

翻译：高维分光谱数据分类是分析化学的一项共同任务。支持矢量机(SVMs)和部分最小平方对立分析(PLS-DA)等既定程序是解决这一受监督的学习问题的最常见方法,然而,对这些模型的解释有时仍然很困难,而且往往采用基于特征选择的解决方案,因为它们导致自动识别最丰富的波长。不幸的是,对于食品真实性等一些微妙的应用,校准和(或)校准组都出现误标和掺杂的光谱,对模型的开发、预测准确性和稳健性产生巨大影响。受这些问题的驱动,本文件提出一种强有力的基于模型的方法,同时进行变量选择、外部值和标签噪声探测。我们展示了我们的建议在处理三种农业食品频谱研究方面的有效性,这些研究考虑了几种形式的扰动。我们的方法成功地减少了问题的复杂性,确定了异常光谱,并在考虑到选定的波长数量非常低的情况下实现了竞争性预测准确性。