Generative spoken language modeling involves learning jointly the acoustic and linguistic characteristics of a language from raw audio only (without text or labels). We introduce metrics to automatically evaluate the generated output in terms of acoustic and linguistic quality in two associated end-to-end tasks, respectively: speech resynthesis (repeating the speech input using the system's own voice), and speech generation (producing novel speech outputs conditional on a spoken prompt, or unconditionally), and validate these metrics with human judgment. We test baseline systems consisting of a discrete speech encoder (returning discrete, low bitrate, pseudo-text units), a generative language model (trained on pseudo-text units), and a speech decoder (generating a waveform from pseudo-text). By comparing three state-of-the-art unsupervised speech encoders (Contrastive Predictive Coding (CPC), wav2vec 2.0, HuBERT), and varying the number of discrete units (50, 100, 200), we investigate how the generative performance depends on the quality of the learned units as measured by unsupervised metrics (zero-shot probe tasks). We will open source our evaluation stack and baseline models.
翻译:生成口头语言模型涉及从原始音频中(没有文字或标签)共同学习一种语言的声学和语言特性。我们引入了衡量尺度,在两个相关的端至端任务中自动评价声学和语言质量产出:语音合成(使用系统的声音重现语音输入)和语音生成(生成以口语迅速或无条件为条件的新语音输出),并用人类判断来验证这些参数。我们测试由独立语音编码器(返回离散、低位元、假文本单元)、基因语言模型(以伪文本单位为培训)和语音解码器(产生假文本的波形)组成的基线系统。通过比较三种最先进的、不受监督的语音编码器(CPC), wav2vec 2.0, HuBERT),以及不同的离散单元数(50、100、200),我们调查基因化性表现如何取决于以未监督的量度模型衡量的已学单元的质量(零点测量的源)。