While log-amplitude mel-spectrogram has widely been used as the feature representation for processing speech based on deep learning, the effectiveness of another aspect of speech spectrum, i.e., phase information, was shown recently for tasks such as speech enhancement and source separation. In this study, we extensively investigated the effectiveness of including phase information of signals for eight audio classification tasks. We constructed a learnable front-end that can compute the phase and its derivatives based on a time-frequency representation with mel-like frequency axis. As a result, experimental results showed significant performance improvement for musical pitch detection, musical instrument detection, language identification, speaker identification, and birdsong detection. On the other hand, overfitting to the recording condition was observed for some tasks when the instantaneous frequency was used. The results implied that the relationship between the phase values of adjacent elements is more important than the phase itself in audio classification.
翻译:虽然在深层学习的基础上广泛使用对称率中位谱作为处理语音的特征表示,但最近为语言增强和源分离等任务展示了语音频谱另一个方面(即阶段信息)的实效。在本研究中,我们广泛调查了将信号的阶段信息纳入八种音频分类任务的有效性。我们建造了一个可学习的前端,可以根据时频表示法计算阶段及其衍生物。因此,实验结果显示,音乐音频探测、乐器探测、语言识别、扬声器识别和鸟群探测等方面的性能有显著改进。另一方面,在使用即时频率时,某些任务与记录条件不相符。结果表明,相邻元素的阶段值之间的关系比音频分类阶段本身更为重要。