从声音代表到模型强力 (From Sound Representation to Model Robustness)

In this paper, we investigate the impact of different standard environmental sound representations (spectrograms) on the recognition performance and adversarial attack robustness of a victim residual convolutional neural network. Averaged over various experiments on three benchmarking environmental sound datasets, we found the ResNet-18 model outperforms other deep learning architectures such as GoogLeNet and AlexNet both in terms of classification accuracy and the number of training parameters. Therefore we set this model as our front-end classifier for subsequent investigations. Herein, we measure the impact of different settings required for generating more informative mel-frequency cepstral coefficient (MFCC), short-time Fourier transform (STFT), and discrete wavelet transform (DWT) representations on our front-end model. This measurement involves comparing the classification performance over the adversarial robustness. On the balance of average budgets allocated by adversary and the cost of attack, we demonstrate an inverse relationship between recognition accuracy and model robustness against six attack algorithms. Moreover, our experimental results show that while the ResNet-18 model trained on DWT spectrograms achieves the highest recognition accuracy, attacking this model is relatively more costly for the adversary compared to other 2D representations.

翻译：在本文中,我们调查了不同标准的无害环境表现(频谱)对受害者残余神经神经网络的认知性表现和对抗性攻击强度的影响。在三个基准环境健全数据集的各种实验中,我们发现ResNet-18模型在分类准确性和培训参数数量方面优于GoogLeNet和AlexNet等其他深层次学习结构。因此,我们将这一模型作为我们今后调查的前端分类器。在这里,我们衡量了产生更多信息性mel-频丙型系数(MFCC)、短时四价变换(STFT)和前端模型离散波变换(DWT)所需的不同设置的影响。我们发现ResNet-18模型在前端模型中比较了对抗性强势的分类性表现。关于对手分配的平均预算与攻击成本之间的平衡,我们展示了对六种攻击算法的确认性准确性和模型强度之间的反差关系。此外,我们的实验结果表明,虽然在DWT光谱图上培训的ResNet-18模型获得了最高准确性,但相对而言,攻击这一模型的相对而言是高廉的。