This paper introduces contrastive siamese (c-siam) network, an architecture for leveraging unlabeled acoustic data in speech recognition. c-siam is the first network that extracts high-level linguistic information from speech by matching outputs of two identical transformer encoders. It contains augmented and target branches which are trained by: (1) masking inputs and matching outputs with a contrastive loss, (2) incorporating a stop gradient operation on the target branch, (3) using an extra learnable transformation on the augmented branch, (4) introducing new temporal augment functions to prevent the shortcut learning problem. We use the Libri-light 60k unsupervised data and the LibriSpeech 100hrs/960hrs supervised data to compare c-siam and other best-performing systems. Our experiments show that c-siam provides 20% relative word error rate improvement over wav2vec baselines. A c-siam network with 450M parameters achieves competitive results compared to the state-of-the-art networks with 600M parameters.
翻译:本文引入了对比性 Siamese (c-siam) 网络, 这是一种在语音识别中利用无标签声学数据的架构。 c- siam 是第一个通过匹配两个相同的变压器编码器输出结果从语音中提取高层次语言信息的网络。 它包含扩大的、目标的分支, 培训内容包括:(1) 掩盖输入, 将输出与对比性损失匹配, (2) 将目标分支上的停止梯度操作纳入其中, (3) 使用扩增分支上的额外可学习转换, (4) 引入新的时间增强功能, 以防止快捷式学习问题。 我们使用 Libri- light 60k 不受监督的数据和 LibriSpeech 100hrs 960hrs 监管的数据来比较c- siam 和其他最佳系统。 我们的实验显示, c- siam 提供了比 wav2vec 基线差20% 的相对字差率改进率。 有 450M 参数的 c- siam 网络与有600M 参数的状态网络相比, 取得了竞争性的结果 。