Recent audio language models can follow long conversations. However, research on emotion-aware or spoken dialogue summarization is constrained by the lack of data that links speech, summaries, and paralinguistic cues. We introduce Spoken DialogSum, the first corpus aligning raw conversational audio with factual summaries, emotion-rich summaries, and utterance-level labels for speaker age, gender, and emotion. The dataset is built in two stages: first, an LLM rewrites DialogSum scripts with Switchboard-style fillers and back-channels, then tags each utterance with emotion, pitch, and speaking rate. Second, an expressive TTS engine synthesizes speech from the tagged scripts, aligned with paralinguistic labels. Spoken DialogSum comprises 13,460 emotion-diverse dialogues, each paired with both a factual and an emotion-focused summary. We release an online demo at https://fatfat-emosum.github.io/EmoDialog-Sum-Audio-Samples/, with plans to release the full dataset in the near future. Baselines show that an Audio-LLM raises emotional-summary ROUGE-L by 28% relative to a cascaded ASR-LLM system, confirming the value of end-to-end speech modeling.
翻译:当前音频语言模型已能处理长对话,但情感感知或口语对话摘要的研究受限于缺乏关联语音、摘要及副语言线索的数据。我们提出Spoken DialogSum,这是首个将原始对话音频与事实摘要、情感丰富摘要以及说话人年龄、性别和情感的语句级标签对齐的语料库。该数据集通过两个阶段构建:首先,利用大语言模型重写DialogSum脚本,加入类似Switchboard的填充词和反馈词,并为每句标注情感、音高和语速;其次,通过富有表现力的文本转语音引擎从标注脚本合成语音,并与副语言标签对齐。Spoken DialogSum包含13,460段情感多样的对话,每段均配有事实摘要和情感聚焦摘要。我们在https://fatfat-emosum.github.io/EmoDialog-Sum-Audio-Samples/发布了在线演示,并计划近期公开完整数据集。基线实验表明,相较于级联的ASR-LLM系统,音频大语言模型将情感摘要的ROUGE-L指标相对提升了28%,验证了端到端语音建模的价值。