The paper describes the open Russian medical language understanding benchmark covering several task types (classification, question answering, natural language inference, named entity recognition) on a number of novel text sets. Given the sensitive nature of the data in healthcare, such a benchmark partially closes the problem of Russian medical dataset absence. We prepare the unified format labeling, data split, and evaluation metrics for new tasks. The remaining tasks are from existing datasets with a few modifications. A single-number metric expresses a model's ability to cope with the benchmark. Moreover, we implement several baseline models, from simple ones to neural networks with transformer architecture, and release the code. Expectedly, the more advanced models yield better performance, but even a simple model is enough for a decent result in some tasks. Furthermore, for all tasks, we provide a human evaluation. Interestingly the models outperform humans in the large-scale classification tasks. However, the advantage of natural intelligence remains in the tasks requiring more knowledge and reasoning.
翻译:本文描述了开放的俄罗斯医学语言理解基准(分类、问答、自然语言推断、名称实体识别),它涵盖了若干新文本组的若干任务类型(分类、回答、自然语言推断、实体识别)。鉴于保健数据的敏感性质,这样的基准部分地解决了俄罗斯医疗数据集缺失的问题。我们为新任务准备了统一的格式标签、数据分割和评价指标。剩下的任务来自现有的数据集,但有一些修改。一个单数指标表示出一种模型应对基准的能力。此外,我们实施了若干基线模型,从简单的模型到有变压器结构的神经网络,并释放了代码。预期,较先进的模型产生更好的性能,但即使是简单的模型也足以在某些任务中取得体面的结果。此外,对于所有任务,我们提供人类评估。有趣的是,模型在大规模分类任务中超越了人类。然而,自然智能的优势仍然存在于需要更多知识和推理的任务中。