A deep neural network solution for time-scale modification (TSM) focused on large stretching factors is proposed, targeting environmental sounds. Traditional TSM artifacts such as transient smearing, loss of presence, and phasiness are heavily accentuated and cause poor audio quality when the TSM factor is four or larger. The weakness of established TSM methods, often based on a phase vocoder structure, lies in the poor description and scaling of the transient and noise components, or nuances, of a sound. Our novel solution combines a sines-transients-noise decomposition with an independent WaveNet synthesizer to provide a better description of the noise component and an improve sound quality for large stretching factors. Results of a subjective listening test against four other TSM algorithms are reported, showing the proposed method to be often superior. The proposed method is stereo compatible and has a wide range of applications related to the slow motion of media content.
翻译:以环境声音为对象,提出了侧重于大伸缩因素的深神经网络(TSM)解决方案; 传统TSM工艺品,如瞬时涂抹、消失和仓促等,在TSM系数为四或四或四以上时严重加重,造成音质质量差; 既有的TSM方法往往基于一个阶段的vocoder结构,其弱点在于对音响的瞬时和噪声组成部分或细微的描述和缩放不够完善; 我们的新解决方案将必备的中转音响音分解与独立的WaveNet合成器结合起来,以提供对噪音组成部分的更好描述,提高大伸缩因素的音质; 报告了针对其他四种TSM算法进行主观听觉测试的结果,表明拟议的方法往往优异; 拟议的方法与立体兼容,与媒体内容的缓慢移动有着广泛的应用。