The audio spectrogram is a time-frequency representation that has been widely used for audio classification. The temporal resolution of a spectrogram depends on hop size. Previous works generally assume the hop size should be a constant value such as ten milliseconds. However, a fixed hop size or resolution is not always optimal for different types of sound. This paper proposes a novel method, DiffRes, that enables differentiable temporal resolution learning to improve the performance of audio classification models. Given a spectrogram calculated with a fixed hop size, DiffRes merges non-essential time frames while preserving important frames. DiffRes acts as a "drop-in" module between an audio spectrogram and a classifier, and can be end-to-end optimized. We evaluate DiffRes on the mel-spectrogram, followed by state-of-the-art classifier backbones, and apply it to five different subtasks. Compared with using the fixed-resolution mel-spectrogram, the DiffRes-based method can achieve the same or better classification accuracy with at least 25% fewer temporal dimensions on the feature level, which alleviates the computational cost at the same time. Starting from a high-temporal-resolution spectrogram such as one-millisecond hop size, we show that DiffRes can improve classification accuracy with the same computational complexity.
翻译:音频光谱是一种时间- 频率代表, 已被广泛用于音频分类。 光谱的瞬时分辨率取决于跳体大小。 以前的作品通常假定跳体大小应该是一个常数值, 如 10 毫秒。 但是, 固定跳体大小或分辨率并不总是对不同类型的声音最合适 。 本文提出了一个新颖的方法, DiffRes, 它可以让不同的时间分辨率学习来改善音频分类模型的性能 。 与使用固定的浮体尺寸计算的光谱比较, DiffRes 合并了非必要的时间框架, 同时保存重要的框架 。 DiffRes 在音频光谱和分类器之间, 通常假定跳体大小是一个常数值的“ 落入” 模块, 并且可以优化端到端。 我们评估Mel- 相光谱的 DiffRes, 后继以状态的分类脊椎来应用它来改进音频分类模式的性能。 与使用固定分辨率的光谱谱组合相比, DiffRes 基方法可以实现相同或更好的分类精确度, 在地貌水平上至少减少25%的时段尺寸,, 以高分辨率计算方式显示高分辨率的分辨率, 。