We aim to solve the highly challenging task of generating continuous sign language videos solely from speech segments for the first time. Recent efforts in this space have focused on generating such videos from human-annotated text transcripts without considering other modalities. However, replacing speech with sign language proves to be a practical solution while communicating with people suffering from hearing loss. Therefore, we eliminate the need of using text as input and design techniques that work for more natural, continuous, freely uttered speech covering an extensive vocabulary. Since the current datasets are inadequate for generating sign language directly from speech, we collect and release the first Indian sign language dataset comprising speech-level annotations, text transcripts, and the corresponding sign-language videos. Next, we propose a multi-tasking transformer network trained to generate signer's poses from speech segments. With speech-to-text as an auxiliary task and an additional cross-modal discriminator, our model learns to generate continuous sign pose sequences in an end-to-end manner. Extensive experiments and comparisons with other baselines demonstrate the effectiveness of our approach. We also conduct additional ablation studies to analyze the effect of different modules of our network. A demo video containing several results is attached to the supplementary material.
翻译:我们的目标是解决仅从演讲部分制作连续手语视频这一极具挑战性的任务,这是首次从演讲部分制作连续手语视频这一极具挑战性的任务。这一空间最近的努力侧重于在不考虑其他方式的情况下从人文附加说明的文字誊本中制作这种视频。然而,用手语取代语言是实际的解决办法,同时与听力损失者沟通。因此,我们不需要将文本用作投入和设计技术,用于更自然、连续、自由发表、涵盖广泛词汇的更自然、持续、自由言论。由于目前的数据集不足以直接从演讲中生成手语,我们收集和发布第一个印度手语数据集,其中包括语音级别的说明、文本誊本和相应的手语视频。接下来,我们提出一个多任务化变压变器网络,培训其从演讲部分生成签名人的姿势。用语音转换器作为辅助任务和额外的交叉模式歧视器,我们的模型学会以端对端方式产生连续的信号顺序。与其它基线的广泛试验和比较表明我们的方法的有效性。我们还进行额外的对比研究,以分析我们网络不同模块的效果。载有若干结果的演示录像带。