In this paper, we investigate the task of hallucinating an authentic high-resolution (HR) human face from multiple low-resolution (LR) video snapshots. We propose a pure transformer-based model, dubbed VidFace, to fully exploit the full-range spatio-temporal information and facial structure cues among multiple thumbnails. Specifically, VidFace handles multiple snapshots all at once and harnesses the spatial and temporal information integrally to explore face alignments across all the frames, thus avoiding accumulating alignment errors. Moreover, we design a recurrent position embedding module to equip our transformer with facial priors, which not only effectively regularises the alignment mechanism but also supplants notorious pre-training. Finally, we curate a new large-scale video face hallucination dataset from the public Voxceleb2 benchmark, which challenges prior arts on tackling unaligned and tiny face snapshots. To the best of our knowledge, we are the first attempt to develop a unified transformer-based solver tailored for video-based face hallucination. Extensive experiments on public video face benchmarks show that the proposed method significantly outperforms the state of the arts.
翻译:在本文中,我们从多个低分辨率(LR)视频片片片中,对真实的高分辨率(HR)人脸进行幻觉研究。我们提出了一个纯粹的变压器模型,称为VidFace,以充分利用全程时空信息和面部结构提示。具体地说,VidFace一次性处理多个片片,利用空间和时间信息整体探索所有框架的面部对齐,从而避免累积校正错误。此外,我们设计了一个经常嵌入模块,为我们的变压器配备面部前科,这不仅有效地规范了校准机制,而且还取代了臭名昭著的训练前期。最后,我们从公众Voxceleb2 基准中绘制了一个新的大型面部幻觉数据。这是对处理不相近和微小面片片的前科的挑战。我们最了解的是,我们第一次尝试开发一个统一的变压器求解器,用于视频面镜。在公共视频脸部基准上的大规模实验显示拟议方法大大超越了状态艺术。