We present two methods of real time magnitude spectrogram inversion: streaming Griffin Lim(GL) and streaming MelGAN. We demonstrate the impact of looking ahead on perceptual quality of MelGAN. As little as one hop size (12.5ms) of lookahead is able to significantly improve perceptual quality in comparison to its causal version. We compare streaming GL with the streaming MelGAN and show different trade-offs in terms of perceptual quality, on-device latency, algorithmic delay, memory footprint and noise sensitivity. For fair quality assessment of the GL approach, we use input log magnitude spectrogram without mel transformation. We evaluate presented real time spectrogram inversion approaches on clean, noisy and atypical speech. We specified conditions when streaming GL has comparable quality with MelGAN: noisy audio and no mel transformation. Streaming GL is 2.4x faster than real time on the ARM CPU of a Pixel4 and has a minimum memory footprint. It makes it attractive for wearable devices.
翻译:我们展示了两种实时规模光谱反转的方法:Griffin Lim(GL)流和MelGAN流。我们展示了对MelGAN感官质量向前看的影响。像头的一跳尺寸(12.5米)小于一跳尺寸(12.5米)能够大大改善感知质量,而与因果版本相比。我们将GL流与流MelGAN比较,在感知质量上显示不同的权衡取舍,在理解时显示宽度、算法延迟、记忆足迹和噪音敏感度。为了对GL方法进行公平质量评估,我们使用输入日志尺寸光谱,而不进行介质变换。我们评估在清洁、噪音和异常的语音上呈现真实时间光谱反射方法。我们指定了在流GLL具有与MelGAN相近质量时的条件:音频和无线变换。在Pixel4的ARM CPU上流GL比实际时间快2.4x, 并有最小的记忆足迹。它对于可磨的装置具有吸引力。