Prokaryotic viruses, which infect bacteria and archaea, are key players in microbial communities. Predicting the hosts of prokaryotic viruses helps decipher the dynamic relationship between microbes. Experimental methods for host prediction cannot keep pace with the fast accumulation of sequenced phages. Thus, there is a need for computational host prediction. Despite some promising results, computational host prediction remains a challenge because of the limited known interactions and the sheer amount of sequenced phages by high-throughput sequencing technologies. The state-of-the-art methods can only achieve 43\% accuracy at the species level. In this work, we formulate host prediction as link prediction in a knowledge graph that integrates multiple protein and DNA-based sequence features. Our implementation named CHERRY can be applied to predict hosts for newly discovered viruses and to identify viruses infecting targeted bacteria. We demonstrated the utility of CHERRY for both applications and compared its performance with 11 popular host prediction methods. To our best knowledge, CHERRY has the highest accuracy in identifying virus-prokaryote interactions. It outperforms all the existing methods at the species level with an accuracy increase of 37\%. In addition, CHERRY's performance on short contigs is more stable than other tools.
翻译:蛋白质病毒感染细菌和考古,是微生物界的关键角色。预测蛋白质病毒的宿主有助于破解微生物之间的动态关系。主机预测的实验方法无法跟上测序的快速积累。因此,有必要进行计算主机预测。尽管取得了一些有希望的结果,但计算主机预测仍是一个挑战,因为已知的相互作用有限,而且通过高通量测序技术测序的测谎数量有限。最先进的方法只能在物种一级达到43 ⁇ 的精确度。在这项工作中,我们将主机预测作为连接预测,将多种蛋白质和基于DNA的序列特性结合成一个知识图表。我们称为CHERRY的实施可以用于预测新发现的病毒宿主,并查明病毒感染目标细菌的病毒。我们展示了CHERRY两种应用的效用,并将其性能与11种流行主机的预测方法进行比较。根据我们的最佳知识,CHERRY在确定病毒- prokariot相互作用方面拥有最高精确度。在确定病毒- prokariot 的精度方面,它超越了现有所有方法的精确性水平。CHERRY的精确度,它比其他工具的精确度要高出37。