To be truly understandable and accepted by Deaf communities, an automatic Sign Language Production (SLP) system must generate a photo-realistic signer. Prior approaches based on graphical avatars have proven unpopular, whereas recent neural SLP works that produce skeleton pose sequences have been shown to be not understandable to Deaf viewers. In this paper, we propose SignGAN, the first SLP model to produce photo-realistic continuous sign language videos directly from spoken language. We employ a transformer architecture with a Mixture Density Network (MDN) formulation to handle the translation from spoken language to skeletal pose. A pose-conditioned human synthesis model is then introduced to generate a photo-realistic sign language video from the skeletal pose sequence. This allows the photo-realistic production of sign videos directly translated from written text. We further propose a novel keypoint-based loss function, which significantly improves the quality of synthesized hand images, operating in the keypoint space to avoid issues caused by motion blur. In addition, we introduce a method for controllable video generation, enabling training on large, diverse sign language datasets and providing the ability to control the signer appearance at inference. Using a dataset of eight different sign language interpreters extracted from broadcast footage, we show that SignGAN significantly outperforms all baseline methods for quantitative metrics and human perceptual studies.
翻译:要真正理解和被聋人社区所接受,自动手语制作系统必须产生一个摄影现实信号。以前基于图形动因的方法已被证明不受欢迎的,而最近产生骨骼序列的神经性 SLP 作品被显示为聋人无法理解。在本文中,我们提议SignGAN(SignGAN),这是第一个直接用口语制作摄影现实连续手语视频的SignGAN(Sign-Report SLP)模型),我们使用一个带有混音密度网络(MDN)配制的变压器结构来处理口语到骨骼的翻译。然后引入一个装配的人类合成模型,从骨骼序列生成一个摄影现实的手语符号语言视频视频。这样可以让直接从书面文本翻译的手语符号视频进行摄影现实的制作。我们进一步提议一个基于关键点的丢失功能功能,在关键点空间操作中大幅提高合成手语图像的质量,以避免运动造成的问题。此外,我们引入了一种控制视频生成的方法,使大量、多样的手势语言模拟合成合成合成语言图像能够从大、用8个信号图像显示系统,提供我们提取的图像模型的图像模型,从所有图像模型的图像模型的图像演示。