ESResNE(X)t-fbsp: 音频学习强力时间-频率变换 (ESResNe(X)t-fbsp: Learning Robust Time-Frequency Transformation of Audio)

Environmental Sound Classification (ESC) is a rapidly evolving field that recently demonstrated the advantages of application of visual domain techniques to the audio-related tasks. Previous studies indicate that the domain-specific modification of cross-domain approaches show a promise in pushing the whole area of ESC forward. In this paper, we present a new time-frequency transformation layer that is based on complex frequency B-spline (fbsp) wavelets. Being used with a high-performance audio classification model, the proposed fbsp-layer provides an accuracy improvement over the previously used Short-Time Fourier Transform (STFT) on standard datasets. We also investigate the influence of different pre-training strategies, including the joint use of two large-scale datasets for weight initialization: ImageNet and AudioSet. Our proposed model out-performs other approaches by achieving accuracies of 95.20 % on the ESC-50 and 89.14 % on the UrbanSound8K datasets. Additionally, we assess the increase of model robustness against additive white Gaussian noise and reduction of an effective sample rate introduced by the proposed layer and demonstrate that the fbsp-layer improves the model's ability to withstand signal perturbations, in comparison to STFT-based training. For the sake of reproducibility, our code is made available.

翻译：环境无害分类(ESC)是一个迅速演变的领域,最近展示了将视觉域技术应用于音频相关任务的优势。以往的研究显示,跨域方法的域性修改表明,在推动ESC整个领域向前推进方面,跨域方法有希望。在本文中,我们提出了一个新的时频转换层,其基础是复杂的频率B-spline(fbsp)波盘(fbsp)(fbsp)波盘)。在使用高性能音频分类模型时,拟议的fbsp-lay提供了比以前在标准数据集中使用的短时空Fleier变换(STFT)更准确的改进。我们还调查了不同培训前战略的影响,包括联合使用两个大型数据集进行权重初始化:图像网和音频SySet。我们提议的模型比其他方法优于95.20%的ESC-50和89.14%的城市Sound8K数据集。此外,我们评估了相对于添加级白质定音器的模型增强的模型坚固度,并降低了由拟议的培训前期战略采用的有效采样率,以图示为图像网络化。我们现有的标准对标准进行了比较。