Time masking has become a de facto augmentation technique for speech and audio tasks, including automatic speech recognition (ASR) and audio classification, most notably as a part of SpecAugment. In this work, we propose SpliceOut, a simple modification to time masking which makes it computationally more efficient. SpliceOut performs comparably to (and sometimes outperforms) SpecAugment on a wide variety of speech and audio tasks, including ASR for seven different languages using varying amounts of training data, as well as on speech translation, sound and music classification, thus establishing itself as a broadly applicable audio augmentation method. SpliceOut also provides additional gains when used in conjunction with other augmentation techniques. Apart from the fully-supervised setting, we also demonstrate that SpliceOut can complement unsupervised representation learning with performance gains in the semi-supervised and self-supervised settings.
翻译:在这项工作中,我们提议对时间掩码进行简单的修改,使其在计算上更有效率。 SpliceOut在广泛的语音和音频任务方面表现得与(有时甚至优于)Spection相似,包括使用不同数量的培训数据对七种不同语言的ASR进行扩增,以及语音翻译、声音和音乐分类,从而确立自己为广泛适用的音频扩增方法。 SpliceOut在与其他扩增技术一起使用时还带来额外收益。除了完全监督的环境外,我们还表明SpliceOut可以在半监督和自我监督的环境中以业绩增益来补充无监督的表述学习。