Kernel methods have been proven to be a powerful tool for the integration and analysis of highthroughput technologies generated data. Kernels offer a nonlinear version of any linear algorithm solely based on dot products. The kernelized version of Principal Component Analysis is a valid nonlinear alternative to tackle the nonlinearity of biological sample spaces. This paper proposes a novel methodology to obtain a data-driven feature importance based on the KPCA representation of the data. The proposed method, kernel PCA Interpretable Gradient (KPCA-IG), provides a datadriven feature importance that is computationally fast and based solely on linear algebra calculations. It has been compared with existing methods on three benchmark datasets. The accuracy obtained using KPCA-IG selected features is equal to or greater than the other methods' average. Also, the computational complexity required demonstrates the high efficiency of the method. An exhaustive literature search has been conducted on the selected genes from a publicly available Hepatocellular carcinoma dataset to validate the retained features from a biological point of view. The results once again remark on the appropriateness of the computed ranking. The black-box nature of kernel PCA needs new methods to interpret the original features. Our proposed methodology KPCA-IG proved to be a valid alternative to select influential variables in high-dimensional high-throughput datasets, potentially unravelling new biological and medical biomarkers.
翻译:核方法已被证明是集成和分析高通量技术数据的强大工具。核提供了任何基于点积的线性算法的非线性版本。内核化的主成分分析是一种有效的非线性选择,用于处理生物样本空间的非线性问题。本文提出了一种新方法,利用基于数据的特征重要性来获取数据的KPCA表示。所提出的方法,核PCA可解释梯度(KPCA-IG),提供了一个仅基于线性代数计算的计算快速的数据驱动特征重要性。已在三个基准数据集上与现有方法进行比较。使用KPCA-IG选择的特征所获得的准确性等于或大于其他方法的平均准确性。此外,所需的计算复杂度表明该方法具有高效性。在公共的肝细胞癌数据集中对所选基因进行了全面的文献搜索,以验证从生物学角度保留的特征。结果再次强调核PCA的黑匣子特性需要新的方法来解释原始特征。我们提出的KPCA-IG方法被证明是选择高维高通量数据集中有影响力的变量的有效替代方法,可能揭示新的生物和医学生物标志物。