Many application studies rely on audio DNN models pre-trained on a large-scale dataset as essential feature extractors, and they extract features from the last layers. In this study, we focus on our finding that the middle layer features of existing supervised pre-trained models are more effective than the late layer features for some tasks. We propose a simple approach to compose features effective for general-purpose applications, consisting of two steps: (1) calculating feature vectors along the time frame from middle/late layer outputs, and (2) fusing them. This approach improves the utility of frequency and channel information in downstream processes, and combines the effectiveness of middle and late layer features for different tasks. As a result, the feature vectors become effective for general purposes. In the experiments using VGGish, PANNs' CNN14, and AST on nine downstream tasks, we first show that each layer output of these models serves different tasks. Then, we demonstrate that the proposed approach significantly improves their performance and brings it to a level comparable to that of the state-of-the-art. In particular, the performance of the non-semantic speech (NOSS) tasks greatly improves, especially on Speech commands V2 with VGGish of +77.1 (14.3% to 91.4%).
翻译:许多应用研究依靠作为基本特征提取器的大规模数据集预先培训的音频 DNN 模型,这些模型从最后一层提取了特征。在本研究中,我们侧重于我们发现,现有受监督的受培训模型的中层特征比一些任务的末层特征更有效。我们提出一个简单的方法来构成对普通用途应用有效的特征,包括两个步骤:(1) 计算中层/末层产出在时间框架内的特性矢量,以及(2) 粉碎它们。这一方法提高了下游过程的频率和频道信息的效用,并结合了中层和末层特性对不同任务的有效性。结果,特性矢量为一般目的变得有效。在使用VGGish、PANN'CNN14和AST的实验中,我们首先表明这些模型的每层输出为不同的任务服务。然后,我们证明拟议的方法大大改进了其性能,使其达到与艺术状态相类似的水平。特别是,非语系演讲(NOSS)的性能(NOSSOSS)和末层特性(VGGM%)任务(14-GIS+GA%)大大改进了V2命令。