The rapid growth in literature accumulates diverse and yet comprehensive biomedical knowledge hidden to be mined such as drug interactions. However, it is difficult to extract the heterogeneous knowledge to retrieve or even discover the latest and novel knowledge in an efficient manner. To address such a problem, we propose EGFI for extracting and consolidating drug interactions from large-scale medical literature text data. Specifically, EGFI consists of two parts: classification and generation. In the classification part, EGFI encompasses the language model BioBERT which has been comprehensively pre-trained on biomedical corpus. In particular, we propose the multi-head attention mechanism and pack BiGRU to fuse multiple semantic information for rigorous context modeling. In the generation part, EGFI utilizes another pre-trained language model BioGPT-2 where the generation sentences are selected based on filtering rules. We evaluated the classification part on "DDIs 2013" dataset and "DTIs" dataset, achieving the FI score of 0.842 and 0.720 respectively. Moreover, we applied the classification part to distinguish high-quality generated sentences and verified with the exiting growth truth to confirm the filtered sentences. The generated sentences that are not recorded in DrugBank and DDIs 2013 dataset also demonstrate the potential of EGFI to identify novel drug relationships.
翻译:文献的迅速增长积累了多种但全面的生物医学知识,这些知识被隐藏在毒品互动等地下,然而,很难以有效的方式获取各种知识,以检索甚至发现最新和新颖的知识。为解决这一问题,我们建议EGFI从大规模医学文献文本数据中提取和整合药物互动。具体地说,EGFI由两部分组成:分类和生成。在分类部分,EGFI包含生物医学材料全面预先培训的语言模式BioBERT。特别是,我们建议多头目关注机制,并包装BIGRU,以结合多种语义信息,进行严格的环境建模。在生成部分,EGFI利用另一个经过预先培训的语言模式BioGPT-2,其中生成的句子是根据过滤规则选择的。我们评估了“DIS 2013” 数据集和“DTIS”数据集的分类部分,分别达到FI在生物物理体方面经过0.842和0.720的评分。此外,我们运用该分类部分来区分高品质生成的句子,并经过更新的生长真相,以确认经过筛选的句子。在2013年药物数据库中,还确定了药物数据库中没有记录的数据。