With the advent of general-purpose speech representations from large-scale self-supervised models, applying a single model to multiple downstream tasks is becoming a de-facto approach. However, the pooling problem remains; the length of speech representations is inherently variable. The naive average pooling is often used, even though it ignores the characteristics of speech, such as differently lengthed phonemes. Hence, we design a novel pooling method to squash acoustically similar representations via vector quantization, which does not require additional training, unlike attention-based pooling. Further, we evaluate various unsupervised pooling methods on various self-supervised models. We gather diverse methods scattered around speech and text to evaluate on various tasks: keyword spotting, speaker identification, intent classification, and emotion recognition. Finally, we quantitatively and qualitatively analyze our method, comparing it with supervised pooling methods.
翻译:随着自监督模型生成通用语音表示的出现,将单一模型应用于多个下游任务正在成为一种默认方法。然而,池化问题仍然存在; 语音表示的长度固有地是可变的。常常使用简单的平均池化,即使它忽略了语音的特点,例如不同长度的音素。因此,我们设计了一种新的池化方法,通过矢量量化压缩声学相似的表示,这不像基于注意力的池化需要额外的训练。此外,我们评估了各种无监督池化方法在各种自监督模型上的表现。我们收集了分散在语音和文本各领域的不同方法,并在各种任务上进行了评估:关键字识别、说话人识别、意图分类和情绪识别。最后,我们使用监督池化方法进行定量和定性分析,与我们的方法进行比较。