Hand gesture serves as a critical role in sign language. Current deep-learning-based sign language recognition (SLR) methods may suffer insufficient interpretability and overfitting due to limited sign data sources. In this paper, we introduce the first self-supervised pre-trainable SignBERT with incorporated hand prior for SLR. SignBERT views the hand pose as a visual token, which is derived from an off-the-shelf pose extractor. The visual tokens are then embedded with gesture state, temporal and hand chirality information. To take full advantage of available sign data sources, SignBERT first performs self-supervised pre-training by masking and reconstructing visual tokens. Jointly with several mask modeling strategies, we attempt to incorporate hand prior in a model-aware method to better model hierarchical context over the hand sequence. Then with the prediction head added, SignBERT is fine-tuned to perform the downstream SLR task. To validate the effectiveness of our method on SLR, we perform extensive experiments on four public benchmark datasets, i.e., NMFs-CSL, SLR500, MSASL and WLASL. Experiment results demonstrate the effectiveness of both self-supervised learning and imported hand prior. Furthermore, we achieve state-of-the-art performance on all benchmarks with a notable gain.
翻译:手势在手语中起着关键作用。 当前的基于深学习的手语识别( SLR) 方法可能由于手势数据源有限而不能充分解释和过度使用。 在本文中, 我们引入了第一个自我监督的预先测试的SignBERT, 并配上SLR的手工。 SignBERT 将手势势作为视觉象征, 由现成的姿势提取器衍生出来。 视觉标志随后嵌入了手势状态、 时间和手动性能信息。 为了充分利用现有的手势数据源, SignBERT首先通过遮掩并重建视觉标志来进行自我监督的预培训。 与几个掩码模型战略一起, 我们试图将手势事先纳入模型- 测试方法, 以更好的模式排列顺序。 之后, 信号BERT 将进行精细调整, 以完成下游SLR任务。 为了验证我们SLR方法的有效性, 我们在四个公共基准数据集上进行了广泛的实验, 即 NMF- CSL、 SL500、 MLSLSLSL、 MSSL 和WLSL 和WLAL 实验所有显著的自我学习成绩。