By summarizing longer consumer health questions into shorter and essential ones, medical question answering (MQA) systems can more accurately understand consumer intentions and retrieve suitable answers. However, medical question summarization is very challenging due to obvious distinctions in health trouble descriptions from patients and doctors. Although existing works have attempted to utilize Seq2Seq, reinforcement learning, or contrastive learning to solve the problem, two challenges remain: how to correctly capture question focus to model its semantic intention, and how to obtain reliable datasets to fairly evaluate performance. To address these challenges, this paper proposes a novel medical question summarization framework using entity-driven contrastive learning (ECL). ECL employs medical entities in frequently asked questions (FAQs) as focuses and devises an effective mechanism to generate hard negative samples. This approach forces models to pay attention to the crucial focus information and generate more ideal question summarization. Additionally, we find that some MQA datasets suffer from serious data leakage problems, such as the iCliniq dataset's 33% duplicate rate. To evaluate the related methods fairly, this paper carefully checks leaked samples to reorganize more reasonable datasets. Extensive experiments demonstrate that our ECL method outperforms state-of-the-art methods by accurately capturing question focus and generating medical question summaries. The code and datasets are available at https://github.com/yrbobo/MQS-ECL.
翻译:通过将较长的消费者健康问题总结为更短、更本质的问题,医疗问题回答(MQA)系统可以更准确地理解消费者的意图并检索合适的答案。然而,由于患者和医生的健康问题描述不同,医疗问题摘要非常具有挑战性。虽然现有的作品尝试运用Seq2Seq、强化学习或对比学习来解决这个问题,但仍然存在两个挑战:如何正确捕捉问题焦点以建模其语义意图,以及如何获得可靠的数据集来公平评估性能。为解决这些问题,本文提出了一种新的医疗问题摘要框架,采用基于实体的对比学习(ECL)。ECL将常见问题中的医疗实体作为焦点,并设计了一种有效的机制来生成较困难的负样本。这种方法迫使模型关注关键的焦点信息,并生成更理想的问题摘要。此外,我们发现一些MQA数据集存在严重的数据泄漏问题,如iCliniq数据集的33%的重复率。为了公平评估相关方法,本文仔细检查泄漏样本以重新组织更合理的数据集。广泛的实验表明,我们的ECL方法通过准确捕捉问题焦点并生成医疗问题摘要胜过了最先进的方法。代码和数据集可在https://github.com/yrbobo/MQS-ECL上获得。