In the genome biology research, regulatory genome modeling is an important topic for many regulatory downstream tasks, such as promoter classification, transaction factor binding sites prediction. The core problem is to model how regulatory elements interact with each other and its variability across different cell types. However, current deep learning methods often focus on modeling genome sequences of a fixed set of cell types and do not account for the interaction between multiple regulatory elements, making them only perform well on the cell types in the training set and lack the generalizability required in biological applications. In this work, we propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT. Specifically, we simultaneously take the 1d sequence of genome data and a 2d matrix of (transcription factors x regions) as the input, where three pre-training tasks are proposed to improve the robustness and generalizability of our model. We pre-train our model on the ATAC-seq dataset with 17 million genome sequences. We evaluate our GeneBERT on regulatory downstream tasks across different cell types, including promoter classification, transaction factor binding sites prediction, disease risk estimation, and splicing sites prediction. Extensive experiments demonstrate the effectiveness of multi-modal and self-supervised pre-training for large-scale regulatory genomics data.
翻译:在基因组生物学研究中,监管基因组建模是许多监管下游任务的重要议题,如推广者分类、交易要素约束地点预测等。核心问题是模拟监管要素如何相互作用以及不同细胞类型之间的差异。然而,目前的深层次学习方法往往侧重于一组固定细胞类型基因组序列的建模,不考虑多种监管要素之间的相互作用,使它们仅能很好地在成套培训中的细胞类型上发挥作用,缺乏生物应用所要求的通用性。在这项工作中,我们提出了一个简单而有效的方法,用于以多模式和自我监督的方式对基因组数据进行预培训,我们称之为GeneBERT。具体地说,我们同时将基因组数据1个序列和2个(设置因素x区域)的2个矩阵作为投入,其中提出三项培训前任务,以提高我们模型的稳健性和可概括性,缺乏1 700万个基因组序列的模型。我们评估了我们GENBERT在不同类型监管下游任务中的工作,包括促进基因组分类、测试性分类、控制性因素的多级风险预测模型,以及测试前的大规模风险预测地点。