Gene-disease associations are fundamental for understanding disease etiology and developing effective interventions and treatments. Identifying genes not yet associated with a disease due to a lack of studies is a challenging task in which prioritization based on prior knowledge is an important element. The computational search for new candidate disease genes may be eased by positive-unlabeled learning, the machine learning setting in which only a subset of instances are labeled as positive while the rest of the data set is unlabeled. In this work, we propose a set of effective network-based features to be used in a novel Markov diffusion-based multi-class labeling strategy for putative disease gene discovery. The performances of the new labeling algorithm and the effectiveness of the proposed features have been tested on ten different disease data sets using three machine learning algorithms. The new features have been compared against classical topological and functional/ontological features and a set of network- and biological-derived features already used in gene discovery tasks. The predictive power of the integrated methodology in searching for new disease genes has been found to be competitive against state-of-the-art algorithms.
翻译:基因疾病协会是了解疾病病理学和开发有效干预和治疗的基础。确定因缺乏研究而尚未与疾病相关的基因是一项具有挑战性的任务,其中根据先前知识确定优先次序是一项重要因素。通过积极、无标签的学习,可以减轻对新候选疾病基因的计算搜索,这种机器学习环境仅将一组病例贴上阳性的标签,而其余数据集则未贴标签。在这项工作中,我们提议一套有效的基于网络的特征,用于小说Markov的基于传播的多级标签战略,以发现假定性疾病基因。新标签算法的性能和拟议特征的有效性已经用三种机器学习算法在十套不同的疾病数据集上进行了测试。新特征已经与古典的地形特征和功能/肿瘤特征以及基因发现任务中已经使用的一组网络和生物衍生特征进行了比较。发现,寻找新疾病基因综合方法的预测能力与最新算法相比具有竞争力。