Motivation: Bacteriophages (aka phages) are viruses that infect bacteria and archaea. Thus, they play important regulatory roles in natural and host-associated ecosystems. As the most abundant and diverse biological entities in the biosphere, phages have received increased attention in their research and applications. In particular, identifying their hosts provides key knowledge for their usages as antibiotics. High-throughput sequencing and its application to the microbiome have offered new opportunities for phage host detection. However, there are two main challenges for computational host prediction. First, the known phage-host relationships are very limited compared to sequenced phages. Second, although the sequence similarity between phages and bacteria has been used as a major feature for host prediction, the alignment is either missing or ambiguous for accurate host prediction. Thus, there is still a need to improve the accuracy of host prediction. Results: In this work, we present a semi-supervised learning model, named HostG, to conduct host prediction for novel phages. We construct a knowledge graph by utilizing both phage-phage protein similarity and phage-host DNA sequence similarity. Then graph convolutional network (GCN) is adopted to exploit phages with or without known hosts in training to enhance the learning ability. During the GCN training, we minimize the expected calibrated error (ECE) to ensure the confidence of the predictions. We tested HostG on both simulated and real sequencing data and the results demonstrated that it competes favorably against the state-of-the-art pipelines.
翻译:动力: 细菌( akaphages) 是感染细菌和考古的病毒。 因此, 在自然和与宿主相关的生态系统中,它们起着重要的监管作用。 作为生物圈中最丰富和最多样化的生物实体, phages在研究和应用中日益受到重视。 特别是, 确定它们的宿主为它们作为抗生素的使用提供了关键知识。 高通量测序及其在微生物组的应用为测出phage主机提供了新的机会 。 然而, 计算主机预测存在两大挑战 。 首先, 已知的phage- 宿主关系与排序的phages 相比非常有限。 第二, 尽管phages和细菌之间的序列相似性被用作宿主预测的一个主要特征, 但对于准确的宿主预测而言,它们要么是缺失或模糊的。 因此,仍然需要提高宿主预测的准确性。 结果 : 在这项工作中, 我们提出了一种半超高端学习模式, 名为HostG, 来对新创主机进行主测算的宿主预测。 我们通过使用cal- CN- comage- holdage host 来建立知识图,, 既确保利用网络的精度测测算能力, 和Creal- g sqolalalalbell 也利用了我们所测测测测测测算的DNA序列。