The reconstruction of microbial genomes from large metagenomic datasets is a critical procedure for finding uncultivated microbial populations and defining their microbial functional roles. To achieve that, we need to perform metagenomic binning, clustering the assembled contigs into draft genomes. Despite the existing computational tools, most of them neglect one important property of the metagenomic data, that is, the noise. To further improve the metagenomic binning step and reconstruct better metagenomes, we propose a deep Contrastive Learning framework for Metagenome Binning (CLMB), which can efficiently eliminate the disturbance of noise and produce more stable and robust results. Essentially, instead of denoising the data explicitly, we add simulated noise to the training data and force the deep learning model to produce similar and stable representations for both the noise-free data and the distorted data. Consequently, the trained model will be robust to noise and handle it implicitly during usage. CLMB outperforms the previous state-of-the-art binning methods significantly, recovering the most near-complete genomes on almost all the benchmarking datasets (up to 17\% more reconstructed genomes compared to the second-best method). It also improves the performance of bin refinement, reconstructing 8-22 more high-quality genomes and 15-32 more middle-quality genomes than the second-best result. Impressively, in addition to being compatible with the binning refiner, single CLMB even recovers on average 15 more HQ genomes than the refiner of VAMB and Maxbin on the benchmarking datasets. CLMB is open-source and available at https://github.com/zpf0117b/CLMB/.
翻译:从大型美代基因数据集中重建微生物基因组是找到未培养的微生物群并界定其微生物功能作用的关键程序。 为了做到这一点,我们需要进行美代基因宾宁(CLMB)的深度对比学习框架, 从而有效地消除噪音的扰动, 并产生更稳定、 更稳定的结果。 尽管现有计算工具, 大部分它们忽略了美代基因数据的重要属性, 即噪音。 为了进一步改进美代基因组的分期更新步骤, 重建更好的美代基因组, 我们提议为Metagenome Binning(CLMB) 建立一个深度对比学习框架, 它可以有效地消除噪音的干扰, 并产生更稳定、 更稳定的结果。 基本上, 我们把模拟噪音纳入培训数据, 迫使深级学习模型为无噪音数据和扭曲的数据提供类似和稳定的表述。 因此, 受过训练的模型将强大到噪音, 并在使用期间隐含地处理它。 CLMBB 超越了先前的精细精炼方法, 恢复了几乎完全公开的精细基因组, 15级的CMBBB 。 也改进了最新的数据质量。