Recent studies have put into question the commonly assumed shift invariance property of convolutional networks, showing that small shifts in the input can affect the output predictions substantially. In this paper, we analyze the benefits of addressing lack of shift invariance in CNN-based sound event classification. Specifically, we evaluate two pooling methods to improve shift invariance in CNNs, based on low-pass filtering and adaptive sampling of incoming feature maps. These methods are implemented via small architectural modifications inserted into the pooling layers of CNNs. We evaluate the effect of these architectural changes on the FSD50K dataset using models of different capacity and in presence of strong regularization. We show that these modifications consistently improve sound event classification in all cases considered. We also demonstrate empirically that the proposed pooling methods increase shift invariance in the network, making it more robust against time/frequency shifts in input spectrograms. This is achieved by adding a negligible amount of trainable parameters, which makes these methods an appealing alternative to conventional pooling layers. The outcome is a new state-of-the-art mAP of 0.541 on the FSD50K classification benchmark.
翻译:最近的研究质疑了革命网络通常假定的变换属性,表明输入的微小变化会大大影响产出预测。 在本文中,我们分析了解决CNN声音事件分类中缺乏变换的好处。具体地说,我们评估了两种组合方法,以改善CNN的变换,其基础是低通道过滤和对收到的地貌图进行适应性抽样。这些方法是通过在CNN集合层中插入小型建筑改造来实施的。我们利用不同能力模型和在高度正规化的情况下,评估这些建筑变化对FSD50K数据集的影响。我们表明,这些修改在所考虑的所有案例中都不断改进了健全的事件分类。我们还从经验上表明,拟议的组合方法提高了网络的变换,使其在输入光谱中的时间/频率变换更加稳健。通过增加微不足道的可训练参数来实现这一点,使这些方法成为常规汇层的诱人替代方法。结果就是FSD50K分类基准0.541的新的最新工艺型 mAP。