Fine-tuning pretrained models for automatically summarizing doctor-patient conversation transcripts presents many challenges: limited training data, significant domain shift, long and noisy transcripts, and high target summary variability. In this paper, we explore the feasibility of using pretrained transformer models for automatically summarizing doctor-patient conversations directly from transcripts. We show that fluent and adequate summaries can be generated with limited training data by fine-tuning BART on a specially constructed dataset. The resulting models greatly surpass the performance of an average human annotator and the quality of previous published work for the task. We evaluate multiple methods for handling long conversations, comparing them to the obvious baseline of truncating the conversation to fit the pretrained model length limit. We introduce a multistage approach that tackles the task by learning two fine-tuned models: one for summarizing conversation chunks into partial summaries, followed by one for rewriting the collection of partial summaries into a complete summary. Using a carefully chosen fine-tuning dataset, this method is shown to be effective at handling longer conversations, improving the quality of generated summaries. We conduct both an automatic evaluation (through ROUGE and two concept-based metrics focusing on medical findings) and a human evaluation (through qualitative examples from literature, assessing hallucination, generalization, fluency, and general quality of the generated summaries).
翻译:在本文中,我们探讨了使用预先训练的变压器模型直接从笔录中自动总结医生-病人谈话情况的可行性。我们表明,通过在专门建造的数据集上微调BART,能够以有限的培训数据产生流畅和适当的摘要。由此得出的模型大大超过了普通人类说明员的性能和以前出版的工作质量。我们评估了处理长途对话的多种方法,将其与缩短对话以适应预先训练的模型长度限制的明显基线进行比较。我们采用了多阶段方法,通过学习两个微调模型来应对任务:一个是将谈话块总结成部分摘要,随后是将部分摘要汇编成完整的摘要。使用精心选择的微调数据集,显示这种方法对处理较长时间的对话十分有效,提高了生成摘要的质量。我们进行了自动评估(通过ROUGE和两个基于概念的计量模型,侧重于医学结果、一般版本、质量摘要)和人文评估(通过医学结果、一般智能、一般智能分析、定性摘要)。