The availability of genomic data has grown exponentially in the last decade, mainly due to the development of new sequencing technologies. Based on the interactions between genes (and gene products) extracted from the increasing genomic data, numerous studies have focused on the identification of associations between genes and functions. While these studies have shown great promise, the problem of annotating genes with functions remains an open challenge. In this work, we present a method to detect missing annotations in hierarchical multi-label classification datasets. We propose a method that exploits the class hierarchy by computing aggregated probabilities to the paths of classes from the leaves to the root for each instance. The proposed method is presented in the context of predicting missing gene function annotations, where these aggregated probabilities are further used to select a set of annotations to be verified through in vivo experiments. The experiments on Oriza sativa Japonica, a variety of rice, showcase that incorporating the hierarchy of classes into the method often improves the predictive performance and our proposed method yields superior results when compared to competitor methods from the literature.
翻译:在过去十年中,基因组数据的提供量急剧增加,这主要是由于新测序技术的发展。根据从越来越多的基因组数据中提取的基因(和基因产品)之间的相互作用,许多研究侧重于确定基因和功能之间的联系。虽然这些研究显示了巨大的希望,但带有功能的基因的注解问题仍然是一个公开的挑战。在这项工作中,我们提出了一个方法来探测等级等级分级多标签分类数据集中缺失的注释。我们提出了一个方法,通过计算从叶到根的班级路径的总概率来利用阶级等级。提议的方法是在预测缺失基因函数说明的背景下提出的,这些综合概率进一步用来选择一套说明,通过活体实验加以核实。在各种稻米Oriza ativa Jasonica上进行的实验显示,将班级的等级纳入方法往往能提高预测性,而我们建议的方法在与文献中的比较方法相比,会产生优异的结果。