We present the Zero Resource Speech Challenge 2021, which asks participants to learn a language model directly from audio, without any text or labels. The challenge is based on the Libri-light dataset, which provides up to 60k hours of audio from English audio books without any associated text. We provide a pipeline baseline system consisting on an encoder based on contrastive predictive coding (CPC), a quantizer ($k$-means) and a standard language model (BERT or LSTM). The metrics evaluate the learned representations at the acoustic (ABX discrimination), lexical (spot-the-word), syntactic (acceptability judgment) and semantic levels (similarity judgment). We present an overview of the eight submitted systems from four groups and discuss the main results.
翻译:我们介绍了2021年零资源演讲挑战,要求参与者直接从音频中学习一种语言模式,没有任何文字或标签,挑战以Libri-light数据集为基础,该数据集提供多达60千小时的英语音频书籍中的音频,没有任何相关文本,我们提供了一个管道基线系统,该系统由基于对比预测编码的编码器(CPC)、一个量子计算器(k-pokes)和一个标准语言模式(BERT或LSTM)组成的编译器组成,评估了在声学(ABX歧视)、词典(现场词)、综合(可接受性判断)和语义水平(类似判断)方面的学术表述,我们概述了四组提交的八种系统,并讨论了主要结果。