MediaSum, a large-scale media interview dataset consisting of 463.6K transcripts with abstractive summaries. To create this dataset, we collect interview transcripts from NPR and CNN and employ the overview and topic descriptions as summaries. Compared with existing public corpora for dialogue summarization, our dataset is an order of magnitude larger and contains complex multi-party conversations from multiple domains. We conduct statistical analysis to demonstrate the unique positional bias exhibited in the transcripts of televised and radioed interviews. We also show that MediaSum can be used in transfer learning to improve a model's performance on other dialogue summarization tasks.
翻译:MediaSum是一个大型媒体访谈数据集,由463.6K记录誊本和抽象摘要组成。为了创建这一数据集,我们从NPR和CNN收集采访记录誊本,并将概览和专题描述用作摘要。与现有的公共对话汇总公司相比,我们的数据集规模更大,包含多个领域的复杂多党对话。我们进行统计分析,以显示电视和无线电访谈记录中显示的独特位置偏差。我们还显示,MediaSum可用于传授学习,以改进模型在其他对话汇总任务上的绩效。