Speech pre-training has primarily demonstrated efficacy on classification tasks, while its capability of generating novel speech, similar to how GPT-2 can generate coherent paragraphs, has barely been explored. Generative Spoken Language Modeling (GSLM) \cite{Lakhotia2021} is the only prior work addressing the generative aspects of speech pre-training, which replaces text with discovered phone-like units for language modeling and shows the ability to generate meaningful novel sentences. Unfortunately, despite eliminating the need of text, the units used in GSLM discard most of the prosodic information. Hence, GSLM fails to leverage prosody for better comprehension, and does not generate expressive speech. In this work, we present a prosody-aware generative spoken language model (pGSLM). It is composed of a multi-stream transformer language model (MS-TLM) of speech, represented as discovered unit and prosodic feature streams, and an adapted HiFi-GAN model converting MS-TLM outputs to waveforms. We devise a series of metrics for prosody modeling and generation, and re-use metrics from GSLM for content modeling. Experimental results show that the pGSLM can utilize prosody to improve both prosody and content modeling, and also generate natural, meaningful, and coherent speech given a spoken prompt. Audio samples can be found at https://speechbot.github.io/pgslm. Codes and models are available at https://github.com/pytorch/fairseq/tree/main/examples/textless_nlp/pgslm.
翻译:语言预言培训主要展示了分类任务的效力,而其生成新语言的能力,类似于GPT-2能够产生一致段落的能力,却几乎未得到探索。General Spoken语言建模(GSLM)\cite{Lakottia2021}是处理语言预训(GSLM)的基因化方面的唯一先行工作,它用所发现的类似手机的语言建模单位取代文本,并显示生成有意义的新句的能力。不幸的是,尽管不需要文本,但GSLM所使用的单位丢弃了大部分流言信息。因此,GSLM未能利用代理模型和生成更好理解,而没有生成表达式语言建模和生成。我们在此工作中,我们展示了一种流式-觉悟化的语音变异语言模型(MS-TLM)模式(MS-TLM),代表了已发现的单位和流体特征流,将MS-TLM的流/流言/流言/流言流流流流流流流流流流流流流流流流流流流流流流流传成。我们在模型和生成模型模型上设计一系列用于模型模型和生成模型,在模型模型模型模型模型模型模型和生成中,我们设计了一系列模型,在模型模型模型和生成/流传流传流传流流流流流流流流流流流流流流流流流流流流流流文,并展示数据/流数据,还数据/流流数据,还数据/流数据,还数据,还数据,并展示数据,还数据,还数据,还利用自然数据内容和流流流流流流流流数据,还数据,还数据,还数据,还数据,还数据,可以产生出。