Sign languages are visual languages which convey information by signers' handshape, facial expression, body movement, and so forth. Due to the inherent restriction of combinations of these visual ingredients, there exist a significant number of visually indistinguishable signs (VISigns) in sign languages, which limits the recognition capacity of vision neural networks. To mitigate the problem, we propose the Natural Language-Assisted Sign Language Recognition (NLA-SLR) framework, which exploits semantic information contained in glosses (sign labels). First, for VISigns with similar semantic meanings, we propose language-aware label smoothing by generating soft labels for each training sign whose smoothing weights are computed from the normalized semantic similarities among the glosses to ease training. Second, for VISigns with distinct semantic meanings, we present an inter-modality mixup technique which blends vision and gloss features to further maximize the separability of different signs under the supervision of blended labels. Besides, we also introduce a novel backbone, video-keypoint network, which not only models both RGB videos and human body keypoints but also derives knowledge from sign videos of different temporal receptive fields. Empirically, our method achieves state-of-the-art performance on three widely-adopted benchmarks: MSASL, WLASL, and NMFs-CSL. Codes are available at https://github.com/FangyunWei/SLRT.
翻译:手语是一种通过手形、面部表情、身体动作等来传达信息的视觉语言。由于这些视觉元素的组合受到固有的限制,手语中存在许多视觉难以区分的符号(VISigns),这限制了视觉神经网络的识别能力。为了缓解这个问题,我们提出了自然语言辅助手语识别(NLA-SLR)框架,它利用了语义信息中包含的"词汇表"(符号标签)。首先,针对具有类似语义意义的VISigns,我们提出了语言感知标签平滑技术,为每个训练符号生成软标签,其平滑权重是由规范化后的符号之间语义相似性计算而来,以便于训练。其次,对于具有不同语义含义的VISigns,我们提出了一种交互式混合技术,将视觉和语义特征混合,以更大程度地利用混合标签指导下不同符号的可分离性。此外,我们还引入了一种新颖的骨干网络模型-视频关键点网络,该模型不仅能建模RGB视频和人体关键点,还能从不同时间响应领域的手语视频中获取知识。经验证,我们的方法在三个广泛采用的基准测试(MSASL、WLASL和NMFs-CSL)上取得了最先进的性能。源代码可在https://github.com/FangyunWei/SLRT找到。