Self-supervised models have had great success in learning speech representations that can generalize to various downstream tasks. HuBERT, in particular, achieves strong performance while being relatively simple in training compared to others. The original experimental setting is computationally extensive, hindering the reproducibility of the models. It is also unclear why certain design decisions are made, such as the ad-hoc loss function, and whether these decisions have an impact on the learned representations. We propose MelHuBERT, a simplified version of HuBERT that takes Mel spectrograms as input, significantly reducing computation and memory consumption. We study several aspects of training, including the loss function, multi-stage training, and streaming options. Our result is a efficient yet performant model that can be trained on a single GPU.
翻译:自我监督的模式在学习能够概括到各种下游任务的演讲演示方面取得了巨大成功。 特别是,HuBERT在培训方面表现良好,但与其他培训相比相对而言相对简单。 最初的实验环境是计算广泛的,妨碍了模型的再复制。 另外,还不清楚为什么做出某些设计决定,如临时损失功能,以及这些决定是否对学习的演示产生影响。 我们提议MelHuBERT, 这是一种将Mel光谱作为投入的简化版本, 大幅降低计算和记忆消耗。 我们研究了培训的若干方面, 包括损失功能、多阶段培训和流学选择。 我们的结果是一个高效但能干的模型, 可以在单一的GPU上接受培训。