We introduce a new unsupervised task, spoken language modeling: the learning of linguistic representations from raw audio signals without any labels, along with the Zero Resource Speech Benchmark 2021: a suite of 4 black-box, zero-shot metrics probing for the quality of the learned models at 4 linguistic levels: phonetics, lexicon, syntax and semantics. We present the results and analyses of a composite baseline made of the concatenation of three unsupervised systems: self-supervised contrastive representation learning (CPC), clustering (k-means) and language modeling (LSTM or BERT). The language models learn on the basis of the pseudo-text derived from clustering the learned representations. This simple pipeline shows better than chance performance on all four metrics, demonstrating the feasibility of spoken language modeling from raw speech. It also yields worse performance compared to text-based 'topline' systems trained on the same data, delineating the space to be explored by more sophisticated end-to-end models.
翻译:我们引入了一个新的不受监督的任务,即口语模型:学习没有标签的原始音频信号的语言表现,以及零资源演讲基准2021:一套4个黑盒零光度尺的套件,用于对4个语言层次的学习模型的质量进行测试:语音、词汇、语法和语义。我们介绍了由三个未经监督的系统组成的综合基线的结果和分析:自我监督的反向表述学习(CPC)、组合(k-pokes)和语言模型(LSTM或BERT)。语言模型以从所学的演示中得出的伪文本为基础学习。这一简单管道比所有4个计量尺度的概率表现好,展示了用原始语言模拟的口头语言的可行性。它也比用同一数据培训的基于文本的“顶线”系统表现更差,通过更先进的端对端模型来探索空间。