Audio-Driven Face Animation is an eagerly anticipated technique for applications such as VR/AR, games, and movie making. With the rapid development of 3D engines, there is an increasing demand for driving 3D faces with audio. However, currently available 3D face animation datasets are either scale-limited or quality-unsatisfied, which hampers further developments of audio-driven 3D face animation. To address this challenge, we propose MMFace4D, a large-scale multi-modal 4D (3D sequence) face dataset consisting of 431 identities, 35,904 sequences, and 3.9 million frames. MMFace4D has three appealing characteristics: 1) highly diversified subjects and corpus, 2) synchronized audio and 3D mesh sequence with high-resolution face details, and 3) low storage cost with a new efficient compression algorithm on 3D mesh sequences. These characteristics enable the training of high-fidelity, expressive, and generalizable face animation models. Upon MMFace4D, we construct a challenging benchmark of audio-driven 3D face animation with a strong baseline, which enables non-autoregressive generation with fast inference speed and outperforms the state-of-the-art autoregressive method. The whole benchmark will be released.
翻译:音频驱动的面部动画是VR/AR、游戏和电影制作等应用程序中期待已久的技术。随着3D引擎的快速发展,越来越多的人需要使用音频驱动的3D面部。然而,当前可用的3D面部动画数据集在规模上或质量上受到限制,这阻碍了音频驱动3D面部动画的进一步发展。为了解决这一挑战,我们提出了MMFace4D,一个大规模多模态4D(3D序列)面部数据集,由431个身份,35,904个序列和390万帧组成。MMFace4D具有三个优点:1)高度多样化的主题和语料库;2)具有高分辨率面部细节的同步音频和3D网格序列;3)使用新的高效压缩算法的3D网格序列低存储成本。这些特征使得可以训练高保真度、富有表现力、具有泛化能力的面部动画模型。在MMFace4D上,我们构建了一个具有强大基线的具有挑战性的音频驱动3D面部动画基准,并启用非自回归生成和快速推理速度,优于最先进的自回归方法。整个基准将被发布。