In this article, we adapted five recent SSL methods to the task of audio classification. The first two methods, namely Deep Co-Training (DCT) and Mean Teacher (MT), involve two collaborative neural networks. The three other algorithms, called MixMatch (MM), ReMixMatch (RMM), and FixMatch (FM), are single-model methods that rely primarily on data augmentation strategies. Using the Wide-ResNet-28-2 architecture in all our experiments, 10% of labeled data and the remaining 90% as unlabeled data for training, we first compare the error rates of the five methods on three standard benchmark audio datasets: Environmental Sound Classification (ESC-10), UrbanSound8K (UBS8K), and Google Speech Commands (GSC). In all but one cases, MM, RMM, and FM outperformed MT and DCT significantly, MM and RMM being the best methods in most experiments. On UBS8K and GSC, MM achieved 18.02% and 3.25% error rate (ER), respectively, outperforming models trained with 100% of the available labeled data, which reached 23.29% and 4.94%, respectively. RMM achieved the best results on ESC-10 (12.00% ER), followed by FM which reached 13.33%. Second, we explored adding the mixup augmentation, used in MM and RMM, to DCT, MT, and FM. In almost all cases, mixup brought consistent gains. For instance, on GSC, FM reached 4.44% and 3.31% ER without and with mixup. Our PyTorch code will be made available upon paper acceptance at https:// github. com/ Labbe ti/ SSLH.
翻译:在本篇文章中,我们根据音频分类任务调整了五种最新的SSL方法。前两种方法,即深联合培训(DCT)和普通教师(MT),涉及两个合作神经网络。其他三种算法,即MixMatch(MM)、ReMixMatch(RMM)和FixMatch(FM),都是主要依靠数据增强战略的单一模式。在我们的所有实验中,使用宽ResNet-28-2结构,10%的标签数据,其余的90%作为未标记的培训数据。我们首先比较了三种标准基准音频数据集的五种方法的错误率:环境健全分类(ESC-10)、城市8K(UBS8K)和谷歌语音指令(GSC)。除了一个案例外,MM、RMM和调频MT大大超过M和DCT(FMT), MM和RMMM(M)是大多数实验中的最佳方法。在UBS8K和GS(GS)中,MM/T)达到18.02和3.25%的接受率率(ER),分别比模型超过100%的模型(ES)。</s>