In this paper we explore the possibility of maximizing the information represented in spectrograms by making the spectrogram basis functions trainable. We experiment with two different tasks, namely keyword spotting (KWS) and automatic speech recognition (ASR). For most neural network models, the architecture and hyperparameters are typically fine-tuned and optimized in experiments. Input features, however, are often treated as fixed. In the case of audio, signals can be mainly expressed in two main ways: raw waveforms (time-domain) or spectrograms (time-frequency-domain). In addition, different spectrogram types are often used and tailored to fit different applications. In our experiments, we allow for this tailoring directly as part of the network. Our experimental results show that using trainable basis functions can boost the accuracy of Keyword Spotting (KWS) by 14.2 percentage points, and lower the Phone Error Rate (PER) by 9.5 percentage points. Although models using trainable basis functions become less effective as the model complexity increases, the trained filter shapes could still provide us with insights on which frequency bins are important for that specific task. From our experiments, we can conclude that trainable basis functions are a useful tool to boost the performance when the model complexity is limited.
翻译:在本文中,我们探索了通过使光谱基础功能可以培训而最大限度地增加光谱图中所含信息的可能性。 我们实验了两种不同的任务, 即关键识别( KWS) 和自动语音识别( ASR ) 。 对于大多数神经网络模型来说, 建筑和超参数通常是在实验中进行微调和优化的。 但是, 输入功能通常被视为固定的。 在音频方面, 信号主要可以以两种主要方式表达: 原始波形( 时间- 数据) 或光谱( 时间- 频率- 数据) 。 此外, 不同的光谱类型经常被使用和定制以适应不同的应用程序。 在我们的实验中, 我们允许直接将这种设计作为网络的一部分。 我们的实验结果显示, 使用可训练的基础功能可以提高关键词点( KWS) 的精确度14.2 个百分点, 并将电话错误率降低9.5个百分点。 尽管使用可训练基础功能的模型随着模型复杂性的提高而变得不那么有效, 受过训练的过滤器形状仍然可以使我们洞察到哪个频率箱对具体任务来说是重要的。 从我们的工作复杂性判断, 。 从有限的实验可以得出一个有限的推力基础。