With the growth of computing power on mobile phones and privacy concerns over user's data, on-device real time speech processing has become an important research topic. In this paper, we focus on methods for real time spectrogram inversion, where an algorithm receives a portion of the input signal (e.g., one frame) and processes it incrementally, i.e., operating in streaming mode. We present a real time Griffin Lim(GL) algorithm using a sliding window approach in STFT domain. The proposed algorithm is 2.4x faster than real time on the ARM CPU of a Pixel4. In addition we explore a neural vocoder operating in streaming mode and demonstrate the impact of looking ahead on perceptual quality. As little as one hop size (12.5ms) of lookahead is able to significantly improve perceptual quality in comparison to a causal model. We compare GL with the neural vocoder and show different trade-offs in terms of perceptual quality, on-device latency, algorithmic delay, memory footprint and noise sensitivity. For fair quality assessment of the GL approach, we use input log magnitude spectrogram without mel transformation. We evaluate presented real time spectrogram inversion approaches on clean, noisy and atypical speech.
翻译:随着移动电话计算能力的增长和对用户数据隐私的担忧,实时语音处理的实时装置已经成为一个重要的研究课题。在本文中,我们侧重于实时光谱反转的方法,即算法接收输入信号的一部分(例如一个框架),并逐步处理,即以流模式操作。我们展示了实时Griffin Lim(GL)算法,在STFT域使用滑动窗口方法。提议的算法比在像素4 的ARM CPU上实时算法快2.4x快。此外,我们还探索了在流模式下运行的神经电码,并展示了前瞻性质量对感官质量的影响。像外观头算算算法的一个跳式(12.5ms)一样,能够大大改善感知质量,与因果模型相比。我们用神经电离子(GL)算法与神经电离子(GL)比较,并显示在感官质量、理解性拉特、算延迟、记忆足和噪音敏感度方面的各种利得度。为了公平质量评估GLSpecial Restroal 的语音变换方法,我们使用了一种不要求。