We introduce dGSLM, the first "textless" model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. We show that our model is able to generate speech, laughter and other paralinguistic signals in the two channels simultaneously and reproduces more naturalistic and fluid turn-taking compared to a text-based cascaded model.
翻译:我们引入了DGSLM,这是第一个能够生成自然语言对话音频样本的“无文本”模型,它利用了最近关于无监督的口语单元发现的工作,以及一个双塔变压器结构,该结构在20小时内经过两道原始对话音频(Fisher数据集)的交叉关注培训,没有任何文字或标签。我们表明,我们的模型能够同时生成两个频道的语音、笑声和其他语言信号,并复制出比基于文字的连锁模型更自然和流体的翻转。