Self-supervised learning (SSL) is seen as a very promising approach with high performance for several speech downstream tasks. Since the parameters of SSL models are generally so large that training and inference require a lot of memory and computational cost, it is desirable to produce compact SSL models without a significant performance degradation by applying compression methods such as knowledge distillation (KD). Although the KD approach is able to shrink the depth and/or width of SSL model structures, there has been little research on how varying the depth and width impacts the internal representation of the small-footprint model. This paper provides an empirical study that addresses the question. We investigate the performance on SUPERB while varying the structure and KD methods so as to keep the number of parameters constant; this allows us to analyze the contribution of the representation introduced by varying the model architecture. Experiments demonstrate that a certain depth is essential for solving content-oriented tasks (e.g. automatic speech recognition) accurately, whereas a certain width is necessary for achieving high performance on several speaker-oriented tasks (e.g. speaker identification). Based on these observations, we identify, for SUPERB, a more compressed model with better performance than previous studies.
翻译:自我监督的学习(SSL)被认为是一种非常有希望的方法,对于一些语言下游任务来说,它具有很高的性能;由于SSL模型的参数通常非常庞大,因此培训和推断需要大量的内存和计算成本,因此最好通过应用诸如知识蒸馏(KD)等压缩方法,在不出现显著性能退化的情况下,制作紧凑的SSL模型。虽然KD方法能够缩小SSL模型结构的深度和/或宽度,但很少研究小脚印模型的内部代表性的深度和广度如何不同。本文提供了解决这一问题的经验性研究。我们研究了SUPERB的性能,同时对结构和KD方法进行了不同,以保持参数数量不变;这使我们得以分析不同模型结构所引入的代表性的贡献。实验表明,一定的深度对于解决内容导向性任务(例如自动语音识别)至关重要,而对于实现若干面向发言者的任务的高性能(例如语音识别)则需要一定的宽度。基于这些观察结果,我们为SUPERB确定了一种比先前的更压缩的模型。