This paper proposes a novel technique to obtain better downstream ASR performance from a joint encoder-decoder self-supervised model when trained with speech pooled from two different channels (narrow and wide band). The joint encoder-decoder self-supervised model extends the HuBERT model with a Transformer decoder. HuBERT performs clustering of features and predicts the class of every input frame. In simple pooling, which is our baseline, there is no way to identify the channel information. To incorporate channel information, we have proposed non-overlapping cluster IDs for speech from different channels. Our method gives a relative improvement of ~ 5% over the joint encoder-decoder self-supervised model built with simple pooling of data, which serves as our baseline.
翻译:本文提出了一种创新技术,从联合编码器-编码器解码器自我监督的模型获得更好的下游ASR性能,培训时使用来自两个不同频道(窄带和宽带)的语音组合。联合编码器-编码器自我监督模型扩展了带有变换器解码器的HuBERT模型。HuBERT对每个输入框架的特性进行组合,并预测了每个输入框架的类别。在作为我们基线的简单集合中,没有办法识别频道信息。为了纳入频道信息,我们提出了不同频道的语音非重叠集束识别码。我们的方法比以简单汇集数据(作为我们的基线)建立的编码器-编码器自我监督模型提高了5%。