We develop scalable randomized kernel methods for jointly associating data from multiple sources and simultaneously predicting an outcome or classifying a unit into one of two or more classes. The proposed methods model nonlinear relationships in multiview data together with predicting a clinical outcome and are capable of identifying variables or groups of variables that best contribute to the relationships among the views. We use the idea that random Fourier bases can approximate shift-invariant kernel functions to construct nonlinear mappings of each view and we use these mappings and the outcome variable to learn view-independent low-dimensional representations. Through simulation studies, we show that the proposed methods outperform several other linear and nonlinear methods for multiview data integration. When the proposed methods were applied to gene expression, metabolomics, proteomics, and lipidomics data pertaining to COVID-19, we identified several molecular signatures forCOVID-19 status and severity. Results from our real data application and simulations with small sample sizes suggest that the proposed methods may be useful for small sample size problems. Availability: Our algorithms are implemented in Pytorch and interfaced in R and would be made available at: https://github.com/lasandrall/RandMVLearn.
翻译:我们开发了可扩展的随机核方法,用于联合关联多个来源的数据,并同时预测结果或将单元分类为两种或更多种类。所提出的方法将多视图数据中的非线性关系建模,同时预测临床结果,并能够确定最佳贡献于视图之间关系的变量或变量组。我们使用随机傅里叶基的思想来近似平移不变的核函数,以构建每个视图的非线性映射,并使用这些映射和结果变量来学习独立于视图的低维表示。通过模拟研究,我们展示了所提出的方法优于多个其他线性和非线性多视图数据集成方法。当将所提出的方法应用于涉及COVID-19的基因表达、代谢组学、蛋白质组学和脂质组学数据时,我们鉴定出了COVID-19状态和严重程度的几种分子特征。来自我们实际数据应用和小样本大小的模拟的结果表明,所提出的方法可能对小样本问题有用。可用性:我们的算法在Pytorch中实现,并在R中进行了接口处理,并可在https://github.com/lasandrall/RandMVLearn上获得。