Although rare diseases are characterized by low prevalence, approximately 300 million people are affected by a rare disease. The early and accurate diagnosis of these conditions is a major challenge for general practitioners, who do not have enough knowledge to identify them. In addition to this, rare diseases usually show a wide variety of manifestations, which might make the diagnosis even more difficult. A delayed diagnosis can negatively affect the patient's life. Therefore, there is an urgent need to increase the scientific and medical knowledge about rare diseases. Natural Language Processing (NLP) and Deep Learning can help to extract relevant information about rare diseases to facilitate their diagnosis and treatments. The paper explores the use of several deep learning techniques such as Bidirectional Long Short Term Memory (BiLSTM) networks or deep contextualized word representations based on Bidirectional Encoder Representations from Transformers (BERT) to recognize rare diseases and their clinical manifestations (signs and symptoms) in the RareDis corpus. This corpus contains more than 5,000 rare diseases and almost 6,000 clinical manifestations. BioBERT, a domain-specific language representation based on BERT and trained on biomedical corpora, obtains the best results. In particular, this model obtains an F1-score of 85.2% for rare diseases, outperforming all the other models.
翻译:虽然罕见疾病的特点是发病率较低,但约有3亿人患有罕见疾病; 对这些疾病的早期和准确诊断是普通医生的一大挑战,他们没有足够的知识来识别这些疾病; 此外,稀有疾病通常表现出各种各样的表现形式,使诊断更加困难; 延迟诊断会对患者的生活产生消极影响; 因此,迫切需要增加关于稀有疾病的科学和医学知识; 自然语言处理和深习可帮助获取有关稀有疾病的相关信息,以便利诊断和治疗; 本文探讨使用多种深层次学习技术,如双向长期短期记忆网络或基于变异器双向连接器的深度背景化字面表征,以识别稀有疾病及其在RareDisamp系统中的临床表现(特征和症状); 该书含有5 000多种罕见疾病和近6 000种临床表现; 生物、生物、生物和能源研究,这是以BERT为基础、经过生物医学公司培训的域域语言代表,获得了最佳结果; 特别是,这一模型获得了各种罕见疾病的最佳结果。