An automatic pitch correction system typically includes several stages, such as pitch extraction, deviation estimation, pitch shift processing, and cross-fade smoothing. However, designing these components with strategies often requires domain expertise and they are likely to fail on corner cases. In this paper, we present KaraTuner, an end-to-end neural architecture that predicts pitch curve and resynthesizes the singing voice directly from the tuned pitch and vocal spectrum extracted from the original recordings. Several vital technical points have been introduced in KaraTuner to ensure pitch accuracy, pitch naturalness, timbre consistency, and sound quality. A feed-forward Transformer is employed in the pitch predictor to capture longterm dependencies in the vocal spectrum and musical note. We also develop a pitch-controllable vocoder based on a novel source-filter block and the Fre-GAN architecture. KaraTuner obtains a higher preference than the rule-based pitch correction approach through A/B tests, and perceptual experiments show that the proposed vocoder achieves significant advantages in timbre consistency and sound quality compared with the parametric WORLD vocoder, phase vocoder and CLPC vocoder.
翻译:自动投球校正系统通常包括几个阶段,如投球提取、偏差估计、投球转换处理和交叉平滑等。然而,设计具有战略的这些组件往往需要域内专长,在角落情况中它们很可能失败。本文我们介绍卡拉图纳,这是一个端到端神经结构,预测投球曲线,并重新合成调球和从原录音中提取的声频中直接发出的声音。卡拉图纳引入了若干关键技术点,以确保投球准确性、投球自然性、滴球一致性和音质。投球预测器在投球预测器中使用了饲料前变异器,以捕捉声频谱和音乐音符中的长期依赖性。我们还根据一个新的源过滤器块和Fre-GAN结构开发了可调调控动电动电码。卡拉图纳通过A/B测试获得比基于规则的投球校正方法更优先的偏好,而且深知性实验显示,拟议的投影机在与准世界投影机、级CLPLP和声学公司相比,在提成像机的一致性和质量上有很大优势。