Since the superiority of Transformer in learning long-term dependency, the sign language Transformer model achieves remarkable progress in Sign Language Recognition (SLR) and Translation (SLT). However, there are several issues with the Transformer that prevent it from better sign language understanding. The first issue is that the self-attention mechanism learns sign video representation in a frame-wise manner, neglecting the temporal semantic structure of sign gestures. Secondly, the attention mechanism with absolute position encoding is direction and distance unaware, thus limiting its ability. To address these issues, we propose a new model architecture, namely PiSLTRc, with two distinctive characteristics: (i) content-aware and position-aware convolution layers. Specifically, we explicitly select relevant features using a novel content-aware neighborhood gathering method. Then we aggregate these features with position-informed temporal convolution layers, thus generating robust neighborhood-enhanced sign representation. (ii) injecting the relative position information to the attention mechanism in the encoder, decoder, and even encoder-decoder cross attention. Compared with the vanilla Transformer model, our model performs consistently better on three large-scale sign language benchmarks: PHOENIX-2014, PHOENIX-2014-T and CSL. Furthermore, extensive experiments demonstrate that the proposed method achieves state-of-the-art performance on translation quality with $+1.6$ BLEU improvements.
翻译:由于变异器在学习长期依赖方面的优势,手语变异器模式在手语识别和翻译方面取得了显著进步。然而,与变异器存在若干问题,这使得它无法更好地理解手语理解。第一个问题是,自留机制以框架方式学习签名视频代表,忽视了标志手势的暂时语义结构。第二,带有绝对位置编码的注意机制是方向和距离,因此限制了它的能力。为了解决这些问题,我们提议了一个新的模型结构,即PISLTRc,具有两个不同的特点:(一) 内容认知和位置认知变异层。具体地说,我们明确选择了使用新颖内容认知邻居收集方法的相关特征。然后,我们将这些特征与位置知情的时间变异层结合起来,从而产生强大的邻里增强信号代表结构。 (二) 将相对位置信息注入到编码器、解码器、甚至分解码器的注意机制中。与Vanilla变异器模型相比,我们用新的内容识别和感知变异相结构系统化模型在三个大尺度的B-2014年测试方法上持续进行更好的表现。