回答摘要:一个人工制成的数据集和解答摘要管道 (AnswerSumm: A Manually-Curated Dataset and Pipeline for Answer Summarization)

Community Question Answering (CQA) fora such as Stack Overflow and Yahoo! Answers contain a rich resource of answers to a wide range of community-based questions. Each question thread can receive a large number of answers with different perspectives. One goal of answer summarization is to produce a summary that reflects the range of answer perspectives. A major obstacle for abstractive answer summarization is the absence of a dataset to provide supervision for producing such summaries. Recent works propose heuristics to create such data, but these are often noisy and do not cover all perspectives present in the answers. This work introduces a novel dataset of 4,631 CQA threads for answer summarization, curated by professional linguists. Our pipeline gathers annotations for all subtasks involved in answer summarization, including the selection of answer sentences relevant to the question, grouping these sentences based on perspectives, summarizing each perspective, and producing an overall summary. We analyze and benchmark state-of-the-art models on these subtasks and introduce a novel unsupervised approach for multi-perspective data augmentation, that further boosts overall summarization performance according to automatic evaluation. Finally, we propose reinforcement learning rewards to improve factual consistency and answer coverage and analyze areas for improvement.

翻译：社区问题解答( CQA) 论坛, 如 Stack Overflow 和 Yahoo 。答案包含对一系列社区问题解答的丰富资源。每条问题线索都可以从不同角度获得大量解答。答案总和的一个目标是产生一个反映解答观点的概要。抽象解答总和的一个主要障碍是缺少一个数据集来监督这种摘要的编制工作。最近的工作提出了创建这类数据的理论,但这些数据往往很吵闹,没有涵盖答案中的所有观点。这项工作引入了一套4,631 CQA线索的新数据集,由专业语言学家整理,用于回答总和。我们的管道收集了与解答总和相关的所有子任务的说明,包括选择与问题相关的答案句子,根据视角将这些句子组合起来,概述每个视角,并产生一个总体摘要。我们对这些子任务中的最新模型进行分析和基准,并引入了一种新型的、不统一的方法,用于多视角数据增强, 由专业语言语言语言学家整理, 从而进一步提升整体分析质量。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【干货书】数据科学手册，456页pdf

专知会员服务

150+阅读 · 2021年9月16日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【Manning新书】现代Java实战，592页pdf

专知会员服务

101+阅读 · 2020年5月22日