CVT-SLR: 基于对比学习的视觉-文本转换和通变对齐的手语识别 (CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment)

Sign language recognition (SLR) is a weakly supervised task that annotates sign videos as textual glosses. Recent studies show that insufficient training caused by the lack of large-scale available sign datasets becomes the main bottleneck for SLR. Most SLR works thereby adopt pretrained visual modules and develop two mainstream solutions. The multi-stream architectures extend multi-cue visual features, yielding the current SOTA performances but requiring complex designs and might introduce potential noise. Alternatively, the advanced single-cue SLR frameworks using explicit cross-modal alignment between visual and textual modalities are simple and effective, potentially competitive with the multi-cue framework. In this work, we propose a novel contrastive visual-textual transformation for SLR, CVT-SLR, to fully explore the pretrained knowledge of both the visual and language modalities. Based on the single-cue cross-modal alignment framework, we propose a variational autoencoder (VAE) for pretrained contextual knowledge while introducing the complete pretrained language module. The VAE implicitly aligns visual and textual modalities while benefiting from pretrained contextual knowledge as the traditional contextual module. Meanwhile, a contrastive cross-modal alignment algorithm is designed to explicitly enhance the consistency constraints. Extensive experiments on public datasets (PHOENIX-2014 and PHOENIX-2014T) demonstrate that our proposed CVT-SLR consistently outperforms existing single-cue methods and even outperforms SOTA multi-cue methods.

翻译：手语识别是一种标注手语视频为文本注释的弱监督任务。最近的研究表明，由于缺乏大规模可用的手语数据集而导致的不充分的训练成为手语识别的主要瓶颈。因此，大多数手语识别工作采用预训练的视觉模块，并开发了两种主流解决方案。多流体系结构扩展了多线索的视觉特征，产生了当前的SOTA性能，但需要复杂的设计，可能会引入潜在的噪声。相反，使用视觉和文本模态之间显式的交叉模态对齐的先进单线索SLR框架简单而有效，有可能与多线索框架竞争力。在这项工作中，我们提出了一种全新的对比视觉-文本转换方法用于SLR，即CVT-SLR，以充分利用视觉和语言模态的预训练知识。基于单线索交叉模态对齐框架，我们提出了一种用于预训练上下文知识的变分自编码器（VAE），并引入了完整的预训练语言模块。VAE在隐式对齐视觉和文本模态的同时，受益于预训练上下文模块的预训练知识。同时，我们设计了一种对比交叉模态对齐算法来明确增强一致性约束。在公共数据集（PHOENIX-2014和PHOENIX-2014T）上的广泛实验表明，我们提出的CVT-SLR始终优于现有的单线索方法，甚至优于SOTA的多线索方法。