Convolution neural networks (CNNs) have good performance in low-complexity classification tasks such as acoustic scene classifications (ASCs). However, there are few studies on the relationship between the length of target speech and the size of the convolution kernels. In this paper, we combine Selective Kernel Network with Temporal-Convolution (TC-SKNet) to adjust the receptive field of convolution kernels to solve the problem of variable length of target voice while keeping low-complexity. GridMask is a data augmentation strategy by masking part of the raw data or feature area. It can enhance the generalization of the model as the role of dropout. In our experiments, the performance gain brought by GridMask is stronger than spectrum augmentation in ASCs. Finally, we adopt AutoML to search best structure of TC-SKNet and hyperparameters of GridMask for improving the classification performance. As a result, a peak accuracy of 59.87% TC-SKNet is equivalent to that of SOTA, but the parameters only use 20.9 K.
翻译:在低复杂度分类任务(如声学场景分类(ASCs)中,神经网络(CNNs)在低复杂度分类任务(如声学场景分类(ASCs))方面表现良好。然而,关于目标言词长度与内核大小之间的关系的研究很少。在本文中,我们将选择性中枢网络与时空演动(TC-SKNet)结合起来,以调整可接受性能领域,解决目标内核声长度变化的问题,同时保持低复杂度。GridMask是一种数据增强战略,它掩盖了原始数据或特征部分。它可以加强模型的普及化作用。在我们实验中,GridMask带来的性能收益比ASCs的频谱增强强。最后,我们采用了AutML,以搜索TC-SKNet和GridMask的超光谱仪的最佳结构来改进分类性能。结果,59.87%的峰值精确度为TC-SKNet相当于SOTA,但参数仅使用20.9K。