Understanding the lip movement and inferring the speech from it is notoriously difficult for the common person. The task of accurate lip-reading gets help from various cues of the speaker and its contextual or environmental setting. Every speaker has a different accent and speaking style, which can be inferred from their visual and speech features. This work aims to understand the correlation/mapping between speech and the sequence of lip movement of individual speakers in an unconstrained and large vocabulary. We model the frame sequence as a prior to the transformer in an auto-encoder setting and learned a joint embedding that exploits temporal properties of both audio and video. We learn temporal synchronization using deep metric learning, which guides the decoder to generate speech in sync with input lip movements. The predictive posterior thus gives us the generated speech in speaker speaking style. We have trained our model on the Grid and Lip2Wav Chemistry lecture dataset to evaluate single speaker natural speech generation tasks from lip movement in an unconstrained natural setting. Extensive evaluation using various qualitative and quantitative metrics with human evaluation also shows that our method outperforms the Lip2Wav Chemistry dataset(large vocabulary in an unconstrained setting) by a good margin across almost all evaluation metrics and marginally outperforms the state-of-the-art on GRID dataset.
翻译:普通人很难理解嘴唇运动和从中推断出讲话的顺序。 准确读嘴唇的任务从发言者及其背景或环境背景的各种提示中得到帮助。 每个发言者都有不同的口音和说话风格,可以从他们的视觉和发言特征中推断出来。 这项工作的目的是了解演讲和个别发言者在不加限制的大词汇中的嘴唇运动顺序之间的关联/绘图。 我们用自动编码设置的变压器前将框架序列建模,并学习了利用音像时间特性的联合嵌入。 我们用深度指标学习了时间同步,引导解译器与输入的嘴唇运动同步生成发言。 预测的后表象给我们以发言者说话风格生成的演讲。 我们在Gridge和Lip2Wav化学演讲数据集上对单一发言者在不受限制的自然环境中的嘴唇运动中进行自然生成任务评估模型进行了培训。 使用各种质量和数量指标进行的广泛评价还表明,我们的方法比LiP2Wav化学数据结构几乎超越了所有数据库的基调值。