Controllable generative sequence models with the capability to extract and replicate the style of specific examples enable many applications, including narrating audiobooks in different voices, auto-completing and auto-correcting written handwriting, and generating missing training samples for downstream recognition tasks. However, typical training algorithms for these controllable sequence generative models suffer from the training-inference mismatch, where the same sample is used as content and style input during training but different samples are given during inference. In this paper, we tackle the training-inference mismatch encountered during unsupervised learning of controllable generative sequence models. By introducing a style transformation module that we call style equalization, we enable training using different content and style samples and thereby mitigate the training-inference mismatch. To demonstrate its generality, we applied style equalization to text-to-speech and text-to-handwriting synthesis on three datasets. Our models achieve state-of-the-art style replication with a similar mean style opinion score as the real data. Moreover, the proposed method enables style interpolation between sequences and generates novel styles.
翻译:具有提取和复制具体实例风格能力的可控基因序列模型能够使许多应用成为可能,包括以不同声音描述音频书籍、自动完成和自动更正书面笔迹,以及生成下游识别任务缺失的培训样本;然而,这些可控序列基因化模型的典型培训算法因培训-推断不匹配而受到影响,即同一样本在培训期间用作内容和风格投入,但在推断过程中给出了不同的样本。在本文中,我们处理在未经监督的学习可控基因序列模型过程中遇到的培训-推断不匹配问题。通过引入一种我们称之为风格均匀化的样式转换模块,我们得以使用不同内容和样式样本进行培训,从而减轻培训与样式的不匹配。为了展示其普遍性,我们应用了对三个数据集的文本对语音和文本对文本的合成等风格。我们的模型实现了最新艺术风格的复制,与真实数据相似的中位风格观点分。此外,拟议的方法还使得不同序列和新风格之间能够进行风格的内插。