Sign language recognition (SLR) is a weakly supervised task that annotates sign videos as textual glosses. Recent studies show that insufficient training caused by the lack of large-scale available sign datasets becomes the main bottleneck for SLR. Most SLR works thereby adopt pretrained visual modules and develop two mainstream solutions. The multi-stream architectures extend multi-cue visual features, yielding the current SOTA performances but requiring complex designs and might introduce potential noise. Alternatively, the advanced single-cue SLR frameworks using explicit cross-modal alignment between visual and textual modalities are simple and effective, potentially competitive with the multi-cue framework. In this work, we propose a novel contrastive visual-textual transformation for SLR, CVT-SLR, to fully explore the pretrained knowledge of both the visual and language modalities. Based on the single-cue cross-modal alignment framework, we propose a variational autoencoder (VAE) for pretrained contextual knowledge while introducing the complete pretrained language module. The VAE implicitly aligns visual and textual modalities while benefiting from pretrained contextual knowledge as the traditional contextual module. Meanwhile, a contrastive cross-modal alignment algorithm is designed to explicitly enhance the consistency constraints. Extensive experiments on public datasets (PHOENIX-2014 and PHOENIX-2014T) demonstrate that our proposed CVT-SLR consistently outperforms existing single-cue methods and even outperforms SOTA multi-cue methods.
翻译:手语识别 (SLR) 是一项弱监督的任务,将手语视频标注为文本注释。最近的研究表明,由于缺乏大规模可用的手语数据集,导致训练不足成为 SLR 的主要瓶颈。因此,大多数 SLR 工作采用预训练的视觉模块,并开发了两个主流解决方案。多流体系结构扩展了多线索视觉特征,产生了当前的 SOTA 性能,但需要复杂的设计,可能引入潜在的噪声。或者,使用显式的视觉和文本模态之间的交叉模态对准的高级单线索 SLR 框架是简单且有效的,潜在地与多线索框架竞争力相当。在本文中,我们提出了一种新的对比视觉-文本变换手语识别方法,CVT-SLR,以充分探索视觉和语言模态的预训练知识。基于单线索交叉模态对齐框架,我们提出了一种用于预训练上下文知识的变分自编码器 (VAE),同时引入完整的预训练语言模块。VAE 隐式对齐视觉和文本模态,在受益于传统上下文模块的预训练上下文知识的同时。同时,我们设计了一种对比交叉模态对齐算法来显式增强一致性约束。公开数据集 (PHOENIX-2014 和 PHOENIX-2014T) 的广泛实验表明,我们提出的 CVT-SLR 在性能上持续优于现有的单线索方法,甚至优于 SOTA 多线索方法。