This research aims to make metaverse characters more realistic by adding lip animations learnt from videos in the wild. To achieve this, our approach is to extend Tacotron 2 text-to-speech synthesizer to generate lip movements together with mel spectrogram in one pass. The encoder and gate layer weights are pre-trained on LJ Speech 1.1 data set while the decoder is retrained on 93 clips of TED talk videos extracted from LRS 3 data set. Our novel decoder predicts displacement in 20 lip landmark positions across time, using labels automatically extracted by OpenFace 2.0 landmark predictor. Training converged in 7 hours using less than 5 minutes of video. We conducted ablation study for Pre/Post-Net and pre-trained encoder weights to demonstrate the effectiveness of transfer learning between audio and visual speech data.
翻译:这项研究旨在通过添加从野生视频中学来的唇动画来使元字更加现实。 为了实现这一点, 我们的方法是扩展Taccatron 2 文本到语音合成器, 在一个关口中生成唇动画和光谱光谱。 编码器和门层重量在LJ Speection 1. 1数据集上经过预先培训, 而解码器则在从LRS 3数据集提取的93个TED访谈视频剪辑上接受再培训。 我们的小说解码器利用OpenFace 2. 0 标志性预测器自动提取的标签预测20个唇标志位置的跨时间迁移。 培训在7小时里集中, 使用不到5分钟的视频。 我们为预/ Post-Net 和预训练的编码器重量进行了通缩放研究, 以展示音频和视觉语音数据之间传输学习的效果。