LitMC-BERT:以变压器为基础的生物医学文献多标签分类,并应用COVID-19文献汇编 (LitMC-BERT: transformer-based multi-label classification of biomedical literature with an application on COVID-19 literature curation)

The rapid growth of biomedical literature poses a significant challenge for curation and interpretation. This has become more evident during the COVID-19 pandemic. LitCovid, a literature database of COVID-19 related papers in PubMed, has accumulated over 180,000 articles with millions of accesses. Approximately 10,000 new articles are added to LitCovid every month. A main curation task in LitCovid is topic annotation where an article is assigned with up to eight topics, e.g., Treatment and Diagnosis. The annotated topics have been widely used both in LitCovid (e.g., accounting for ~18% of total uses) and downstream studies such as network generation. However, it has been a primary curation bottleneck due to the nature of the task and the rapid literature growth. This study proposes LITMC-BERT, a transformer-based multi-label classification method in biomedical literature. It uses a shared transformer backbone for all the labels while also captures label-specific features and the correlations between label pairs. We compare LITMC-BERT with three baseline models on two datasets. Its micro-F1 and instance-based F1 are 5% and 4% higher than the current best results, respectively, and only requires ~18% of the inference time than the Binary BERT baseline. The related datasets and models are available via https://github.com/ncbi/ml-transformer.

翻译：生物医学文献的迅速增长给整理和解释带来了重大挑战。这在COVID-19大流行期间更加明显。LitCovid是普布迈德省COVID-19相关论文文献数据库,它积累了超过180 000篇文章,有数百万次访问。每月向LitCovid增加大约10 000个新文章。LitCovid的主要整理任务是一个专题说明,其中将文章指定了多达8个专题,例如,治疗和诊断。附加说明的主题在LitCovid(例如,占总使用量的18%)和网络生成等下游研究中被广泛使用。然而,由于任务性质和快速文学增长,这是一个主要的整理瓶颈。LITMC-BERT是生物医学文献中基于变压器的多标签分类方法。它使用一个共享的变压器主干线,同时捕捉到特定标签特征和标签对配对比的关联性。我们将LITMC-BERT(计算总使用量的~18%)和网络生成的下游研究(BBBBIB)中三个基线模型和BLF结果。它仅需要两个模型,目前BLITMC1和BLBLBL5的精确数据模型。它的最佳数据模型。它只比目前的数据比B1和BR1和BRBBF相关的模型要求分别要求使用最佳模型和BF1和BV1和B4结果。它的最佳数据。它的最佳模型需要。在两个B1和B1和B1和B1和B4B1和B4BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB1和BB1和BBBBBBBBBBB1的模型中要求。它的最佳数据。它的最佳数据样本中要求。它的最佳数据模型分别是两个模型。