We propose in this work a multi-view learning approach for audio and music classification. Considering four typical low-level representations (i.e. different views) commonly used for audio and music recognition tasks, the proposed multi-view network consists of four subnetworks, each handling one input types. The learned embedding in the subnetworks are then concatenated to form the multi-view embedding for classification similar to a simple concatenation network. However, apart from the joint classification branch, the network also maintains four classification branches on the single-view embedding of the subnetworks. A novel method is then proposed to keep track of the learning behavior on the classification branches and adapt their weights to proportionally blend their gradients for network training. The weights are adapted in such a way that learning on a branch that is generalizing well will be encouraged whereas learning on a branch that is overfitting will be slowed down. Experiments on three different audio and music classification tasks show that the proposed multi-view network not only outperforms the single-view baselines but also is superior to the multi-view baselines based on concatenation and late fusion.
翻译:在这项工作中,我们建议对音乐和音乐分类采用多视角学习方法。考虑到通常用于音乐和音乐识别任务的四种典型的低层次表现(即不同观点),拟议的多视角网络由四个子网络组成,每个处理一个输入类型。随后,在子网络中学习的嵌入被融合成多视角嵌入的分类与简单的连接网络相似。然而,除了联合分类分支外,网络还在子网络的单视图嵌入上保留四个分类分支。然后,提出了一个新颖的方法来跟踪分类分支的学习行为,并调整其权重以按比例混合其梯度,用于网络培训。这些权重被调整到这样可以鼓励在一个正在广泛化的分支上学习,而在一个过于匹配的分支上的学习将会放慢速度。在三个不同的音频和音乐分类任务上进行的实验表明,拟议的多视角网络不仅超越了单视图基线,而且优于基于连接和晚融合的多视角基线。