Gene Ontology (GO) is the primary gene function knowledge base that enables computational tasks in biomedicine. The basic element of GO is a term, which includes a set of genes with the same function. Existing research efforts of GO mainly focus on predicting gene term associations. Other tasks, such as generating descriptions of new terms, are rarely pursued. In this paper, we propose a novel task: GO term description generation. This task aims to automatically generate a sentence that describes the function of a GO term belonging to one of the three categories, i.e., molecular function, biological process, and cellular component. To address this task, we propose a Graph-in-Graph network that can efficiently leverage the structural information of GO. The proposed network introduces a two-layer graph: the first layer is a graph of GO terms where each node is also a graph (gene graph). Such a Graph-in-Graph network can derive the biological functions of GO terms and generate proper descriptions. To validate the effectiveness of the proposed network, we build three large-scale benchmark datasets. By incorporating the proposed Graph-in-Graph network, the performances of seven different sequence-to-sequence models can be substantially boosted across all evaluation metrics, with up to 34.7%, 14.5%, and 39.1% relative improvements in BLEU, ROUGE-L, and METEOR, respectively.
翻译:基因内学( GO) 是一个主要的基因功能知识基础, 可以在生物医学中进行计算任务。 GO 的基本元素是一个术语, 包括一组具有相同功能的基因。 GO 的现有研究工作主要侧重于预测基因术语关联。 其他任务, 如生成新术语描述, 很少执行。 在本文中, 我们提出一个新的任务 : GO 术语描述生成。 任务旨在自动生成一个句子, 描述属于三个类别之一的 GO 术语的功能, 即分子功能、 生物过程和细胞组件。 为了完成这项任务, 我们提议了一个GO 的图内网络, 能够有效地利用 GO 的结构信息。 提议的网络采用双层图 : 第一个图是 GO 术语的图表, 其中每个节点也是图表( genegenegraph 图表) 。 这样的图中可以产生 GO 术语的生物学功能, 并产生正确的描述。 为了验证拟议网络的有效性, 我们建立了三个大型基准数据集。 通过将拟议的GA- GRA- GRA- 网络, 的运行情况分别纳入 GOEOEO- gre- gre- gre- greph 网络, 和 real- main real real real real- mates as acreal 14 massal- preal- preal- preal- bal- bal- 4 mass- bals as acre 14 messal- 。