In this work, we are dedicated to leveraging the BERT pre-training success and modeling the domain-specific statistics to fertilize the sign language recognition~(SLR) model. Considering the dominance of hand and body in sign language expression, we organize them as pose triplet units and feed them into the Transformer backbone in a frame-wise manner. Pre-training is performed via reconstructing the masked triplet unit from the corrupted input sequence, which learns the hierarchical correlation context cues among internal and external triplet units. Notably, different from the highly semantic word token in BERT, the pose unit is a low-level signal originally located in continuous space, which prevents the direct adoption of the BERT cross-entropy objective. To this end, we bridge this semantic gap via coupling tokenization of the triplet unit. It adaptively extracts the discrete pseudo label from the pose triplet unit, which represents the semantic gesture/body state. After pre-training, we fine-tune the pre-trained encoder on the downstream SLR task, jointly with the newly added task-specific layer. Extensive experiments are conducted to validate the effectiveness of our proposed method, achieving new state-of-the-art performance on all four benchmarks with a notable gain.
翻译:在这项工作中,我们致力于利用BERT培训前培训成功率,并模拟特定领域的统计,以肥化手语识别-(SLR)模式(SLR)模式。考虑到手和身体在手语表达中的主导地位,我们将它们组织成三重单位,以框架化的方式将其注入变形主干线。培训前从腐蚀的输入序列中重建遮蔽的三重单位,以了解内部和外部三重单位之间的等级相关性信号。值得注意的是,与BERT中高度语义符号不同的是,构成单位是一个低层次的信号,最初位于连续空间,这阻碍了直接采用BERT交叉渗透目标。为此,我们通过三重单位的配对符号化来弥合这一语义差距。它从代表语义姿态/身体状态的三重单位中提取离散的假标签。在培训前,我们微调整了在下游SLR任务上经过训练的编码,同时增加了新的任务特定级别标准。我们提出的四度测试将实现新的任务特定级别标准。