The real-time processing of time series signals is a critical issue for many real-life applications. The idea of real-time processing is especially important in audio domain as the human perception of sound is sensitive to any kind of disturbance in perceived signals, especially the lag between auditory and visual modalities. The rise of deep learning (DL) models complicated the landscape of signal processing. Although they often have superior quality compared to standard DSP methods, this advantage is diminished by higher latency. In this work we propose novel method for minimization of inference time latency and memory consumption, called Short-Term Memory Convolution (STMC) and its transposed counterpart. The main advantage of STMC is the low latency comparable to long short-term memory (LSTM) networks. Furthermore, the training of STMC-based models is faster and more stable as the method is based solely on convolutional neural networks (CNNs). In this study we demonstrate an application of this solution to a U-Net model for a speech separation task and GhostNet model in acoustic scene classification (ASC) task. In case of speech separation we achieved a 5-fold reduction in inference time and a 2-fold reduction in latency without affecting the output quality. The inference time for ASC task was up to 4 times faster while preserving the original accuracy.
翻译:实时处理时间序列信号对于许多现实应用来说是一个关键问题。实时处理对于许多现实应用来说是一个关键问题。实时处理的概念在音域中特别重要,因为人对声音的感知对感知信号的任何干扰都具有敏感性,特别是听觉和视觉模式之间的滞后。深层次学习(DL)模型的兴起使信号处理的景观变得复杂。虽然与标准DSP方法相比,深层次学习(DL)模型往往质量较高,但这种优势却因较高的潜伏力而减弱。在这项工作中,我们提出了尽量减少推断时间间隔和记忆消耗的新方法,称为短期记忆变异(STMC)及其转换对应。在STEMC的主要优势是,与长期记忆(LSTM)网络相比,低的延度是任何类型的干扰。此外,基于STMC的模型的培训速度更快和更加稳定,因为方法完全以动态神经网络为基础。在本次研究中,我们展示了将这一解决方案应用于U-Net的语音分离任务和GhingNet模型(ASC)的任务。在语音场分类中,我们实现了语音分解5倍的语音分解,同时将原质量减为4倍。