The task of converting text input into video content is becoming an important topic for synthetic media generation. Several methods have been proposed with some of them reaching close-to-natural performances in constrained tasks. In this paper, we tackle a subissue of the text-to-video generation problem, by converting the text into lip landmarks. However, we do this using a modular, controllable system architecture and evaluate each of its individual components. Our system, entitled FlexLip, is split into two separate modules: text-to-speech and speech-to-lip, both having underlying controllable deep neural network architectures. This modularity enables the easy replacement of each of its components, while also ensuring the fast adaptation to new speaker identities by disentangling or projecting the input features. We show that by using as little as 20 min of data for the audio generation component, and as little as 5 min for the speech-to-lip component, the objective measures of the generated lip landmarks are comparable with those obtained when using a larger set of training samples. We also introduce a series of objective evaluation measures over the complete flow of our system by taking into consideration several aspects of the data and system configuration. These aspects pertain to the quality and amount of training data, the use of pretrained models, and the data contained therein, as well as the identity of the target speaker; with regard to the latter, we show that we can perform zero-shot lip adaptation to an unseen identity by simply updating the shape of the lips in our model.
翻译:将文本输入转换成视频内容的任务正在成为合成媒体生成的一个重要主题。 已经提出了几种方法,其中一些方法在受限制的任务中接近自然性能。 在本文中,我们通过将文本转换成文字生成问题的一个小问题,将文本转换成视频生成问题,将文本转换成文字生成问题转变为唇标语。 但是,我们使用模块化、可控制的系统架构,并评价其每个组成部分。我们的系统名为FlexLip, 分为两个不同的模块: 文本转换成语音和语音翻转, 两者都具有可控制的深层神经网络结构。 这种模块化可以方便地替换其中的每个组成部分,同时通过脱钩或投放输入特性,确保快速适应新发言者的形状。 我们通过使用20分钟的数据生成组件和小于5分钟的内容来做到这一点。 我们制作的文本缩略图的客观计量与使用更多培训样本时获得的版本相类似。 我们还对系统完整版本进行一系列客观评价,通过更新更新我们的系统,从质量角度来进行数据更新,然后将数据在其中进行更新,从若干方面进行数据更新,然后对数据进行升级。