We have gained access to vast amounts of multi-omics data thanks to Next Generation Sequencing. However, it is challenging to analyse this data due to its high dimensionality and much of it not being annotated. Lack of annotated data is a significant problem in machine learning, and Self-Supervised Learning (SSL) methods are typically used to deal with limited labelled data. However, there is a lack of studies that use SSL methods to exploit inter-omics relationships on unlabelled multi-omics data. In this work, we develop a novel and efficient pre-training paradigm that consists of various SSL components, including but not limited to contrastive alignment, data recovery from corrupted samples, and using one type of omics data to recover other omic types. Our pre-training paradigm improves performance on downstream tasks with limited labelled data. We show that our approach outperforms the state-of-the-art method in cancer type classification on the TCGA pan-cancer dataset in semi-supervised setting. Moreover, we show that the encoders that are pre-trained using our approach can be used as powerful feature extractors even without fine-tuning. Our ablation study shows that the method is not overly dependent on any pretext task component. The network architectures in our approach are designed to handle missing omic types and multiple datasets for pre-training and downstream training. Our pre-training paradigm can be extended to perform zero-shot classification of rare cancers.
翻译:由于“下一代”的静默,我们获得了大量多组数据。然而,分析这些数据具有挑战性,因为数据具有高度的维度,而且大部分数据没有附加说明。缺乏附加说明的数据是机器学习中的一个重大问题,而“自我监督学习”方法通常用于处理标签有限的数据。然而,缺乏使用SSL方法来利用无标签的多组数据来利用无标签的多组数据方面的多种组间关系的研究。在这项工作中,我们开发了一个创新而有效的培训前模式,由各种SSL组成部分组成,包括但不限于对比性校准、从腐败样本中恢复数据,以及使用一种类型的显性数据来恢复其他类。我们的培训前模式用有限的贴标签数据改进下游任务的业绩。我们发现,在半超超超超超超级环境中,我们的方法在癌症类前分类中超越了癌症类前的分类方法。此外,我们显示,即使精细的培训前的精选方法,从腐败样本中回收了其他的样本数据,使用我们模型的方法也无法被过度地用来调整。我们结构中的任何结构方法都能够用来作为强的模型。