Recent work in spoken language modeling shows the possibility of learning a language unsupervisedly from raw audio without any text labels. The approach relies first on transforming the audio into a sequence of discrete units (or pseudo-text) and then training a language model directly on such pseudo-text. Is such a discrete bottleneck necessary, potentially introducing irreversible errors in the encoding of the speech signal, or could we learn a language model without discrete units at all? In this work, we study the role of discrete versus continuous representations in spoken language modeling. We show that discretization is indeed essential for good results in spoken language modeling. We show that discretization removes linguistically irrelevant information from the continuous features, helping to improve language modeling performances. On the basis of this study, we train a language model on the discrete units of the HuBERT features, reaching new state-of-the-art results in the lexical, syntactic and semantic metrics of the Zero Resource Speech Challenge 2021 (Track 1 - Speech Only).
翻译:口语建模的近期工作显示了从没有文字标签的原始音频中不受监督地学习语言的可能性。 这种方法首先依赖于将音频转换成一个离散单元的序列( 或伪文本), 然后直接用这种假文本来培训语言模型。 这种离散的瓶颈是必要的, 可能在语音信号编码中引入不可逆转的错误, 或者我们可以学习一个语言模型, 完全没有离散单元? 在这项工作中, 我们研究了离散和连续演示在口语建模中的作用 。 我们显示, 离散对于口语建模的良好结果确实至关重要 。 我们显示, 离散将语言上无关的信息从连续的功能中去除, 有助于改进语言模型的性能 。 根据这项研究, 我们为HuBERT特性的离散单元培训一个语言模型, 在 Zero资源语音挑战 2021 ( Track 1 - 仅使用语音) 的词汇、 和语系和语系测量中达到新的状态结果 。