Medical Subject Heading (MeSH) indexing refers to the problem of assigning a given biomedical document with the most relevant labels from an extremely large set of MeSH terms. Currently, the vast number of biomedical articles in the PubMed database are manually annotated by human curators, which is time consuming and costly; therefore, a computational system that can assist the indexing is highly valuable. When developing supervised MeSH indexing systems, the availability of a large-scale annotated text corpus is desirable. A publicly available, large corpus that permits robust evaluation and comparison of various systems is important to the research community. We release a large scale annotated MeSH indexing corpus, MeSHup, which contains 1,342,667 full text articles in English, together with the associated MeSH labels and metadata, authors, and publication venues that are collected from the MEDLINE database. We train an end-to-end model that combines features from documents and their associated labels on our corpus and report the new baseline.
翻译:医学主题标题(MesHH)指数化是指使用极其庞大的MesHH术语中最相关的标签分配特定生物医学文件的问题,目前,PubMed数据库中大量生物医学物品由人类管理员人工附加说明,这耗时费钱;因此,一个可协助编制索引的计算系统非常宝贵;在开发受监督的MesHE指数化系统时,提供大规模附加说明的文本材料是可取的;一个可供公众查阅的大型材料,允许对各种系统进行有力的评估和比较,这对研究界很重要;我们发行了一个大规模附加说明的MesH索引,MesHup,其中载有1,342,667份英文全文文章,以及相关的MesHH标签和元数据、作者和出版地点,从MEDLINE数据库收集。我们培训一个端到端模型,将文件的特征及其相关标签综合起来,并在我们的资料库中报告新的基线。