The digital signal processing-based representations like the Mel-Frequency Cepstral Coefficient are well known to be a solid basis for various audio processing tasks. Alternatively, analog feature representations, relying on analog-electronics-feasible bandpass filtering, allow much lower system power consumption compared with the digital counterpart, while parity performance on traditional tasks like voice activity detection can be achieved. This work explores the possibility of using analog features on multiple speech processing tasks that vary in time dependencies: wake word detection, keyword spotting, and speaker identification. The results of this evaluation show that the analog features are still more power-efficient and competitive on simpler tasks than digital features but yield an increasing performance drop on more complex tasks when long-time correlations are present. We also introduce a novel theoretical framework based on information theory to understand this performance drop by quantifying information flow in feature calculation which helps identify the performance bottlenecks. The theoretical claims are experimentally validated, leading to a maximum of 6% increase of keyword spotting accuracy, even surpassing the digital baseline features. The proposed analog-feature-based systems could pave the way to achieving best-in-class accuracy and power consumption simultaneously.
翻译:众所周知,以数字信号处理为基础的数字信号处理代表方式,如Mel-Ferquicent Cepstral Covaly,是各种音频处理任务的坚实基础。或者,模拟地物表示方式,依靠模拟电子-可行带宽过滤,使系统能比数字对口机能消耗大大低于数字对口机能,同时在语音活动探测等传统任务上实现同等性能。这项工作探索了在多种语音处理任务上使用模拟性能的可能性,这种模拟性能因时间依赖而异:警醒字识别、关键字识别和语音识别。这次评价的结果显示,模拟性特征在比数字特征更简单的任务上仍然更具有权力效率和竞争力,但在存在长期相关性时,在更复杂的任务上产生越来越多的性能下降。我们还采用了基于信息理论的新理论框架,通过在功能计算中量化信息流来理解这一性能下降,从而帮助确定性能瓶颈。理论主张是实验性的,导致关键词识别准确度增加最多6%,甚至超过数字基线特征。拟议的模拟地基系统可以同时为达到最佳的准确性和动力消费铺平道路。