Self-training has been shown to be helpful in addressing data scarcity for many domains, including vision, speech, and language. Specifically, self-training, or pseudo-labeling, labels unsupervised data and adds that to the training pool. In this work, we investigate and use pseudo-labeling for a recently proposed novel setup: joint transcription and translation of speech, which suffers from an absence of sufficient data resources. We show that under such data-deficient circumstances, the unlabeled data can significantly vary in domain from the supervised data, which results in pseudo-label quality degradation. We investigate two categories of remedies that require no additional supervision and target the domain mismatch: pseudo-label filtering and data augmentation. We show that pseudo-label analysis and processing as such results in additional gains on top of the vanilla pseudo-labeling setup resulting in total improvements of up to 0.6% absolute WER and 2.2 BLEU points.
翻译:自我培训被证明有助于解决包括视觉、言语和语言在内的许多领域的数据稀缺问题。 具体来说, 自我培训, 或假标签, 贴上不受监督的数据标签, 并将这些数据添加到培训人才库中。 在这项工作中, 我们调查并使用假标签, 用于最近提议的新设置: 联合翻译和翻译语言, 缺乏足够的数据资源。 我们显示, 在这种数据缺乏的情况下, 未贴标签的数据与受监督的数据在域内可能有很大差异, 从而导致伪标签质量下降。 我们调查了两类不需要额外监督的补救措施, 并针对域错配问题: 伪标签过滤和数据增强。 我们显示, 伪标签分析和处理在香草假标签设置上取得了更多成果, 结果是将绝对WER和2.2 BLEU点全面改进到0.6%。