改进核PCA中的变量可解释性 (Improvement of variables interpretability in kernel PCA)

Kernel methods have been proven to be a powerful tool for the integration and analysis of highthroughput technologies generated data. Kernels offer a nonlinear version of any linear algorithm solely based on dot products. The kernelized version of Principal Component Analysis is a valid nonlinear alternative to tackle the nonlinearity of biological sample spaces. This paper proposes a novel methodology to obtain a data-driven feature importance based on the KPCA representation of the data. The proposed method, kernel PCA Interpretable Gradient (KPCA-IG), provides a datadriven feature importance that is computationally fast and based solely on linear algebra calculations. It has been compared with existing methods on three benchmark datasets. The accuracy obtained using KPCA-IG selected features is equal to or greater than the other methods' average. Also, the computational complexity required demonstrates the high efficiency of the method. An exhaustive literature search has been conducted on the selected genes from a publicly available Hepatocellular carcinoma dataset to validate the retained features from a biological point of view. The results once again remark on the appropriateness of the computed ranking. The black-box nature of kernel PCA needs new methods to interpret the original features. Our proposed methodology KPCA-IG proved to be a valid alternative to select influential variables in high-dimensional high-throughput datasets, potentially unravelling new biological and medical biomarkers.

翻译：核方法已被证明是集成和分析高通量技术数据的强大工具。核提供了任何基于点积的线性算法的非线性版本。内核化的主成分分析是一种有效的非线性选择，用于处理生物样本空间的非线性问题。本文提出了一种新方法，利用基于数据的特征重要性来获取数据的KPCA表示。所提出的方法，核PCA可解释梯度（KPCA-IG），提供了一个仅基于线性代数计算的计算快速的数据驱动特征重要性。已在三个基准数据集上与现有方法进行比较。使用KPCA-IG选择的特征所获得的准确性等于或大于其他方法的平均准确性。此外，所需的计算复杂度表明该方法具有高效性。在公共的肝细胞癌数据集中对所选基因进行了全面的文献搜索，以验证从生物学角度保留的特征。结果再次强调核PCA的黑匣子特性需要新的方法来解释原始特征。我们提出的KPCA-IG方法被证明是选择高维高通量数据集中有影响力的变量的有效替代方法，可能揭示新的生物和医学生物标志物。

相关内容

PCA

关注 3

在统计中，主成分分析（PCA）是一种通过最大化每个维度的方差来将较高维度空间中的数据投影到较低维度空间中的方法。给定二维，三维或更高维空间中的点集合，可以将“最佳拟合”线定义为最小化从点到线的平均平方距离的线。可以从垂直于第一条直线的方向类似地选择下一条最佳拟合线。重复此过程会产生一个正交的基础，其中数据的不同单个维度是不相关的。这些基向量称为主成分。

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

75+阅读 · 2022年6月28日

Into the Metaverse，93页ppt介绍元宇宙概念、应用、趋势

专知会员服务

49+阅读 · 2022年2月19日

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

45+阅读 · 2020年10月31日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日