面向消费者医疗问题摘要的数据集与基准测试 (A Dataset and Benchmark for Consumer Healthcare Question Summarization)

The quest for seeking health information has swamped the web with consumers health-related questions. Generally, con- sumers use overly descriptive and peripheral information to express their medical condition or other healthcare needs, contributing to the challenges of natural language understanding. One way to address this challenge is to summarize the questions and distill the key information of the original question. Recently, large-scale datasets have significantly propelled the development of several summarization tasks, such as multi-document summarization and dialogue summarization. However, a lack of a domain-expert annotated dataset for the consumer healthcare questions summarization task inhibits the development of an efficient summarization system. To address this issue, we introduce a new dataset, CHQ-Sum,m that contains 1507 domain-expert annotated consumer health questions and corresponding summaries. The dataset is derived from the community question answering forum and therefore provides a valuable resource for understanding consumer health-related posts on social media. We benchmark the dataset on multiple state-of-the-art summarization models to show the effectiveness of the dataset

翻译：寻求健康信息的需求使得网络上充斥着消费者提出的健康相关问题。通常，消费者会使用过度描述性和边缘性信息来表达其医疗状况或其他医疗需求，这给自然语言理解带来了挑战。应对这一挑战的一种方法是总结问题并提炼原始问题的关键信息。近年来，大规模数据集显著推动了多项摘要任务的发展，例如多文档摘要和对话摘要。然而，消费者医疗问题摘要任务缺乏领域专家标注的数据集，阻碍了高效摘要系统的开发。为解决这一问题，我们引入了一个新的数据集CHQ-Sum，其中包含1507个由领域专家标注的消费者健康问题及相应摘要。该数据集源自社区问答论坛，因此为理解社交媒体上消费者发布的健康相关帖子提供了宝贵资源。我们在多种最先进的摘要模型上对该数据集进行基准测试，以证明其有效性。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【视觉和语言导航:任务、方法和未来方向的综述】Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

专知会员服务

37+阅读 · 2022年3月25日