Controllable generative sequence models with the capability to extract and replicate the style of specific examples enable many applications, including narrating audiobooks in different voices, auto-completing and auto-correcting written handwriting, and generating missing training samples for downstream recognition tasks. However, under an unsupervised-style setting, typical training algorithms for controllable sequence generative models suffer from the training-inference mismatch, where the same sample is used as content and style input during training but unpaired samples are given during inference. In this paper, we tackle the training-inference mismatch encountered during unsupervised learning of controllable generative sequence models. The proposed method is simple yet effective, where we use a style transformation module to transfer target style information into an unrelated style input. This method enables training using unpaired content and style samples and thereby mitigate the training-inference mismatch. We apply style equalization to text-to-speech and text-to-handwriting synthesis on three datasets. We conduct thorough evaluation, including both quantitative and qualitative user studies. Our results show that by mitigating the training-inference mismatch with the proposed style equalization, we achieve style replication scores comparable to real data in our user studies.
翻译:具有提取和复制具体实例风格能力的可控基因序列模型,能够提取和复制具体实例的风格,使许多应用程序得以应用,包括以不同声音叙事录音簿、自动完成和自动更正书面笔迹,以及为下游识别任务生成缺失的培训样本;然而,在未受监督的风格环境下,可控序列基因模型典型培训算法因培训-推断不匹配而受到影响,在培训-推断期间,使用相同的样本作为内容和风格投入,但在推断过程中则提供无孔不入的样本。在本文件中,我们处理在不受监督地学习可控基因序列模型期间遇到的培训-推断不匹配问题。拟议方法简单而有效,我们使用样式转换模块将目标样式信息转换为不相干的方式输入。这种方法使培训使用不易的内容和样式样本,从而减轻培训-推断的不匹配。我们在三个数据集上对文本到语音和文本到写作综合。我们进行了彻底的评估,包括定量和定性用户研究。我们提出的方法非常简单,但有效,我们使用一个样式模块将目标样式信息转换成不相比的学习方法,我们通过降低用户的比重化性数据比重性格式,从而实现我们提出的数据转换。