In speech-related classification tasks, frequency-domain acoustic features such as logarithmic Mel-filter bank coefficients (FBANK) and cepstral-domain acoustic features such as Mel-frequency cepstral coefficients (MFCC) are often used. However, time-domain features perform more effectively in some sound classification tasks which contain non-vocal or weakly speech-related sounds. We previously proposed a feature called bit sequence representation (BSR), which is a time-domain binary acoustic feature based on the raw waveform. Compared with MFCC, BSR performed better in environmental sound detection and showed comparable accuracy performance in limited-vocabulary speech recognition tasks. In this paper, we propose a novel improvement BSR feature called BSR-float16 to represent floating-point values more precisely. We experimentally demonstrated the complementarity among time-domain, frequency-domain, and cepstral-domain features using a dataset called Speech Commands proposed by Google. Therefore, we used a simple back-end score fusion method to improve the final classification accuracy. The fusion results also showed better noise robustness.
翻译:在与语言有关的分类任务中,经常使用频谱的声学特征,如对数梅尔过滤器银行系数(FBANK)和超声学常识(Cepstral-dome),例如Mel-频率 Cepstral系数(MFCC),但是,在含有非口头或微弱语音相关声音的一些健全的分类任务中,时间界特征能更有效地发挥作用。我们以前曾提议了一个称为位序列代表(BSR)的特征,这是一个基于原始波形的时序二元声学特征。与MFCC相比,BSR在环境声音探测方面表现得更好,在有限语音识别任务中表现出相似的准确性。在本文件中,我们提议了名为 BSR-float16 的新式的BSR 改进性功能,以更准确地代表浮点值。我们实验性地展示了时间区、频率-度和 cepstral-dormay之间的互补性,使用了谷歌建议的称为语音命令的数据集。因此,我们使用了简单的后端分组合方法来提高最后分类的准确性。