This paper describes NEREL-BIO -- an annotation scheme and corpus of PubMed abstracts in Russian and smaller number of abstracts in English. NEREL-BIO extends the general domain dataset NEREL by introducing domain-specific entity types. NEREL-BIO annotation scheme covers both general and biomedical domains making it suitable for domain transfer experiments. NEREL-BIO provides annotation for nested named entities as an extension of the scheme employed for NEREL. Nested named entities may cross entity boundaries to connect to shorter entities nested within longer entities, making them harder to detect. NEREL-BIO contains annotations for 700+ Russian and 100+ English abstracts. All English PubMed annotations have corresponding Russian counterparts. Thus, NEREL-BIO comprises the following specific features: annotation of nested named entities, it can be used as a benchmark for cross-domain (NEREL -> NEREL-BIO) and cross-language (English -> Russian) transfer. We experiment with both transformer-based sequence models and machine reading comprehension (MRC) models and report their results. The dataset is freely available at https://github.com/nerel-ds/NEREL-BIO.
翻译:本文介绍NEREL-BIO -- -- 俄罗斯语的注解办法和简略摘要汇编 -- -- 俄罗斯语的简称和英文的缩略语。 NEREL-BIO通过引入特定域实体类型来扩展普通域数据集 NEREL-BIO注解办法,它涵盖一般域和生物医学域,使其适合域转移实验。 NEREL-BIO为嵌巢名称实体提供注解,作为NEREL(NEREL - > NEREL-BIO)所用办法的延伸。 Nested 命名实体可能跨越实体边界,与在较长实体内嵌入的较短实体连接,使其更难被检测。NEREL-BIO载有700+俄罗斯语和100+英语摘要的说明。所有英语普梅德注解说明都有相应的俄罗斯对应方。因此,NEREL-BIO包括以下具体特征:嵌入名称实体的注解,可以用作跨域转移的基准(NEREL - > NEREL-B-BIO)和跨语文(英语 - > 俄罗斯语)转移的基准。我们用基于变换的序列模型和机读结果报告/MC。