In this paper, we describe our submissions to the ZeroSpeech 2021 Challenge and SUPERB benchmark. Our submissions are based on the recently proposed FaST-VGS model, which is a Transformer-based model that learns to associate raw speech waveforms with semantically related images, all without the use of any transcriptions of the speech. Additionally, we introduce a novel extension of this model, FaST-VGS+, which is learned in a multi-task fashion with a masked language modeling objective in addition to the visual grounding objective. On ZeroSpeech 2021, we show that our models perform competitively on the ABX task, outperform all other concurrent submissions on the Syntactic and Semantic tasks, and nearly match the best system on the Lexical task. On the SUPERB benchmark, we show that our models also achieve strong performance, in some cases even outperforming the popular wav2vec2.0 model.
翻译:在本文中,我们描述了我们向2021年零语言挑战与SUPERB基准提交的呈文。我们提交呈文的依据是最近提出的FaST-VGS模型,这是一个基于变异器的模型,它学会将原语言波形与语言相关图像联系起来,而没有使用任何发言的抄录。此外,我们引入了这一模型的新颖扩展,即FaST-VGS+,它以多任务方式学习,除了视觉地面目标外,还带有隐蔽语言建模目标。在2021年零语言建模上,我们展示了我们的模型在ABX任务上具有竞争力的表现,比所有其他同步和语义任务同时提交的呈文都要好,几乎与莱氏任务上的最佳系统相匹配。在SUPERB的基准上,我们展示了我们的模型也取得了很强的业绩,在某些情况下甚至超过了流行的 wav2vec2.0模型。