跨组学联合嵌入的对比学习和自注意多组学综合应用于不完整多组学数据 (CLCLSA: Cross-omics Linked embedding with Contrastive Learning and Self Attention for multi-omics integration with incomplete multi-omics data)

Integration of heterogeneous and high-dimensional multi-omics data is becoming increasingly important in understanding genetic data. Each omics technique only provides a limited view of the underlying biological process and integrating heterogeneous omics layers simultaneously would lead to a more comprehensive and detailed understanding of diseases and phenotypes. However, one obstacle faced when performing multi-omics data integration is the existence of unpaired multi-omics data due to instrument sensitivity and cost. Studies may fail if certain aspects of the subjects are missing or incomplete. In this paper, we propose a deep learning method for multi-omics integration with incomplete data by Cross-omics Linked unified embedding with Contrastive Learning and Self Attention (CLCLSA). Utilizing complete multi-omics data as supervision, the model employs cross-omics autoencoders to learn the feature representation across different types of biological data. The multi-omics contrastive learning, which is used to maximize the mutual information between different types of omics, is employed before latent feature concatenation. In addition, the feature-level self-attention and omics-level self-attention are employed to dynamically identify the most informative features for multi-omics data integration. Extensive experiments were conducted on four public multi-omics datasets. The experimental results indicated that the proposed CLCLSA outperformed the state-of-the-art approaches for multi-omics data classification using incomplete multi-omics data.

翻译：多组学数据的集成在理解遗传数据中变得越来越重要。每种组学技术仅提供潜在的生物过程的有限视图，同时集成异质性组学层将导致对疾病和表型的更全面和详细的理解。然而，在执行多组学数据集成时面临的障碍之一是存在由于仪器敏感性和成本而产生的不成对多组学数据。如果研究中缺少或不完整地涵盖了受试者的某些方面，则可能会失败。本文提出了一种用于不完整数据的多组学集成的深度学习方法：基于对比学习和自注意机制的跨组学联合嵌入（CLCLSA）。利用完整的多组学数据作为监督，在模型中运用跨组学自编码器学习跨不同类型的生物数据的特征表示。在潜在特征拼接之前应用多组学对比学习来最大化不同组学之间的互信息。此外，过特征级自注意和组学级自注意机制来动态识别多组学数据集成所需的最具信息量的特征。在四个公共多组学数据集上进行了广泛的实验，实验结果表明，所提出的CLCLSA方法在利用不完整多组学数据进行多组学数据分类时优于现有技术。