The recent state of the art on monocular 3D face reconstruction from image data has made some impressive advancements, thanks to the advent of Deep Learning. However, it has mostly focused on input coming from a single RGB image, overlooking the following important factors: a) Nowadays, the vast majority of facial image data of interest do not originate from single images but rather from videos, which contain rich dynamic information. b) Furthermore, these videos typically capture individuals in some form of verbal communication (public talks, teleconferences, audiovisual human-computer interactions, interviews, monologues/dialogues in movies, etc). When existing 3D face reconstruction methods are applied in such videos, the artifacts in the reconstruction of the shape and motion of the mouth area are often severe, since they do not match well with the speech audio. To overcome the aforementioned limitations, we present the first method for visual speech-aware perceptual reconstruction of 3D mouth expressions. We do this by proposing a "lipread" loss, which guides the fitting process so that the elicited perception from the 3D reconstructed talking head resembles that of the original video footage. We demonstrate that, interestingly, the lipread loss is better suited for 3D reconstruction of mouth movements compared to traditional landmark losses, and even direct 3D supervision. Furthermore, the devised method does not rely on any text transcriptions or corresponding audio, rendering it ideal for training in unlabeled datasets. We verify the efficiency of our method through exhaustive objective evaluations on three large-scale datasets, as well as subjective evaluation with two web-based user studies.
翻译:(b) 此外,由于Deep Learning的到来,最近关于独眼3D面部重建的艺术状态在图像数据上取得了一些令人印象深刻的进展。然而,它主要侧重于来自一个RGB图像的输入,忽略了以下重要因素:(a) 如今,绝大多数感兴趣的面部图像数据并非来自单一图像,而是来自包含丰富动态信息的视频。 (b) 此外,这些视频通常以某种口头交流形式(公开谈话、电话会议、视听人-计算机互动、访谈、电影中的独白/对话等)捕捉个人。 当现有的3D面部重建方法应用在这种视频中时,对口部形状和运动的重建的艺术品往往十分严厉,因为它们与语音音频不匹配。为了克服上述限制,我们介绍了视觉语音感知觉重建3D口音表达方式的第一种方法。 我们提出一种“滑动”式损失,引导3D对口部谈话头部的感知知会类似于原始的录像片段。 我们证明,在重建口部的形状和动作方面, 3级的客观数据是直接的, 模拟、口头分析是更精确地分析。