End-to-end learning models using raw waveforms as input have shown superior performances in many audio recognition tasks. However, most model architectures are based on convolutional neural networks (CNN) which were mainly developed for visual recognition tasks. In this paper, we propose an extension of squeeze-and-excitation networks (SENets) which adds temporal feedback control from the top-layer features to channel-wise feature activations in lower layers using a recurrent module. This is analogous to the adaptive gain control mechanism of outer hair-cell in the human auditory system. We apply the proposed model to speech command recognition and show that it slightly outperforms the SENets and other CNN-based models. We also investigate the details of the performance improvement by conducting failure analysis and visualizing the channel-wise feature scaling induced by the temporal feedback.
翻译:以原始波形作为投入的端到端学习模型显示,在许多音频识别任务方面表现优异,但是,大多数模型结构都以主要为视觉识别任务开发的进化神经网络为基础。在本文件中,我们提议扩大挤压和刺激网络(SENets),从上层特征中增加时间反馈控制,到利用一个经常性模块在下层启动流向特征。这类似于人类听觉系统中外发细胞的适应性增益控制机制。我们采用拟议的模型来进行语音指令识别,并显示它略优于SENetes和其他CNN型模型。我们还通过进行故障分析并直观地显示时间反馈所引发的频道错位特征缩放,来调查性能改进的细节。