This study presents three deidentified large medical text datasets, named DISCHARGE, ECHO and RADIOLOGY, which contain 50K, 16K and 378K pairs of report and summary that are derived from MIMIC-III, respectively. We implement convincing baselines of automated abstractive summarization on the proposed datasets with pre-trained encoder-decoder language models, including BERT2BERT, T5-large and BART. Further, based on the BART model, we leverage the sampled summaries from the train set as prior knowledge guidance, for encoding additional contextual representations of the guidance with the encoder and enhancing the decoding representations in the decoder. The experimental results confirm the improvement of ROUGE scores and BERTScore made by the proposed method, outperforming the larger model T5-large.
翻译:本研究提出了三个已知的大型医学文本数据集,分别名为DISCHARGE、ECHO和RADIOLOGY,分别包含50K、16K和378K对报告和来自MIMIMIC-III的报告和摘要。我们用预先训练的编码器-解码器语言模型,包括BERT2BERT、T5大和BART,对拟议的数据集实施令人信服的自动抽象总结基线。此外,根据BART模型,我们利用列车的抽样摘要作为先前的知识指南,对指南与编码器的进一步背景描述进行编码,并加强解码器中的解码表示。实验结果证实,拟议方法对ROUGE分数和BERTS计数的改进超过了更大的T5大模型。