Gene-disease associations are fundamental for the understanding of disease mechanisms and for the development of effective interventions and treatments. Identifying genes not yet associated with a disease due to lack of studies is a challenging task in which prioritization based on prior knowledge can result helpful. The computational search for new candidate disease genes may be eased by Positive-Unlabelled (PU) learning, the machine learning (ML) setting in which only a subset of instances are labelled as positive, while the rest of the data set is unlabelled. In this work, we propose a set of effective network-based features to be used in a novel Markov diffusion-based multi-class labelling strategy for putative disease gene discovery. The performances of the new labelling algorithm and the effectiveness of the proposed features have been tested on five different disease datasets using three ML algorithms. Such features have been compared against classical topological and functional/ontological features showing that they outperform the classical ones both in binary classification and in the multi-class labelling. Analogously, the predictive power of the integrated methodology in searching new disease genes has been found to be competitive against the state-of-the-art algorithms.
翻译:基因疾病协会是了解疾病机制和发展有效干预和治疗的基础。确定因缺乏研究而尚未与疾病相联系的基因是一项具有挑战性的任务,在这项工作中,根据先前的知识确定优先次序会有所助益。对新候选疾病基因的计算搜索可以通过积极-无标签(PU)学习来缓解,而机器学习(ML)设置仅将一组病例列为呈阳性,而其余数据集则不贴标签。在这项工作中,我们提出一套有效的基于网络的功能,用于新的马可夫传播基于扩散的多级标签战略,以发现假冒疾病基因。新的标签算法的性能和拟议特征的有效性已经用三种ML算法在五个不同的疾病数据集上进行了测试。这些特征与古典的地貌学和功能/肿瘤特征进行了比较,表明它们在二进制分类和多级标签中都超越了古典的特征。我们发现,在寻找新疾病基因的综合性方法的预测能力与州-艺术算法相比是竞争性的。