Spoken DialogSum：面向口语对话摘要的情感丰富对话数据集 (Spoken DialogSum: An Emotion-Rich Conversational Dataset for Spoken Dialogue Summarization)

Recent audio language models can follow long conversations. However, research on emotion-aware or spoken dialogue summarization is constrained by the lack of data that links speech, summaries, and paralinguistic cues. We introduce Spoken DialogSum, the first corpus aligning raw conversational audio with factual summaries, emotion-rich summaries, and utterance-level labels for speaker age, gender, and emotion. The dataset is built in two stages: first, an LLM rewrites DialogSum scripts with Switchboard-style fillers and back-channels, then tags each utterance with emotion, pitch, and speaking rate. Second, an expressive TTS engine synthesizes speech from the tagged scripts, aligned with paralinguistic labels. Spoken DialogSum comprises 13,460 emotion-diverse dialogues, each paired with both a factual and an emotion-focused summary. We release an online demo at https://fatfat-emosum.github.io/EmoDialog-Sum-Audio-Samples/, with plans to release the full dataset in the near future. Baselines show that an Audio-LLM raises emotional-summary ROUGE-L by 28% relative to a cascaded ASR-LLM system, confirming the value of end-to-end speech modeling.

翻译：当前音频语言模型已能处理长对话，但情感感知或口语对话摘要的研究受限于缺乏关联语音、摘要及副语言线索的数据。我们提出Spoken DialogSum，这是首个将原始对话音频与事实摘要、情感丰富摘要以及说话人年龄、性别和情感的语句级标签对齐的语料库。该数据集通过两个阶段构建：首先，利用大语言模型重写DialogSum脚本，加入类似Switchboard的填充词和反馈词，并为每句标注情感、音高和语速；其次，通过富有表现力的文本转语音引擎从标注脚本合成语音，并与副语言标签对齐。Spoken DialogSum包含13,460段情感多样的对话，每段均配有事实摘要和情感聚焦摘要。我们在https://fatfat-emosum.github.io/EmoDialog-Sum-Audio-Samples/发布了在线演示，并计划近期公开完整数据集。基线实验表明，相较于级联的ASR-LLM系统，音频大语言模型将情感摘要的ROUGE-L指标相对提升了28%，验证了端到端语音建模的价值。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日