Convolutional Neural Networks are widely used in various machine learning domains. In image processing, the features can be obtained by applying 2D convolution to all spatial dimensions of the input. However, in the audio case, frequency domain input like Mel-Spectrogram has different and unique characteristics in the frequency dimension. Thus, there is a need for a method that allows the 2D convolution layer to handle the frequency dimension differently. In this work, we introduce SubSpectral Normalization (SSN), which splits the input frequency dimension into several groups (sub-bands) and performs a different normalization for each group. SSN also includes an affine transformation that can be applied to each group. Our method removes the inter-frequency deflection while the network learns a frequency-aware characteristic. In the experiments with audio data, we observed that SSN can efficiently improve the network's performance.
翻译:在各种机器学习领域广泛使用进化神经网络。 在图像处理中,通过将2D进化到输入的所有空间层面可以取得功能。 但是,在音频方面,Mel-Spectrotrogram这样的频域输入在频率层面具有不同和独特的特性。 因此,需要一种方法,使2D进化层能够以不同的方式处理频率层面。 在这项工作中,我们引入子分谱正常化(SSN),将输入频率层面分成若干组(次波段),并对每个组进行不同的正常化。 SSN还包含一个可以适用于每个组的折形变形。 我们的方法消除了频率偏转,而网络则学习了频率认知特性。 在对音频数据的实验中,我们发现SNN可以有效地改进网络的性能。