The ultimate goal of continuous sign language recognition(CSLR) is to facilitate the communication between special people and normal people, which requires a certain degree of real-time and deploy-ability of the model. However, in the previous research on CSLR, little attention has been paid to the real-time and deploy-ability. In order to improve the real-time and deploy-ability of the model, this paper proposes a zero parameter, zero computation temporal superposition crossover module(TSCM), and combines it with 2D convolution to form a "TSCM+2D convolution" hybrid convolution, which enables 2D convolution to have strong spatial-temporal modelling capability with zero parameter increase and lower deployment cost compared with other spatial-temporal convolutions. The overall CSLR model based on TSCM is built on the improved ResBlockT network in this paper. The hybrid convolution of "TSCM+2D convolution" is applied to the ResBlock of the ResNet network to form the new ResBlockT, and random gradient stop and multi-level CTC loss are introduced to train the model, which reduces the final recognition WER while reducing the training memory usage, and extends the ResNet network from image classification task to video recognition task. In addition, this study is the first in CSLR to use only 2D convolution extraction of sign language video temporal-spatial features for end-to-end learning for recognition. Experiments on two large-scale continuous sign language datasets demonstrate the effectiveness of the proposed method and achieve highly competitive results.
翻译:连续手语识别(CSLR)的最终目标是促进特殊人和正常人之间的沟通,这要求该模式具有一定程度的实时和可部署性。然而,在以往对CSLR的研究中,对实时和可部署性几乎没有重视。为了改进该模式的实时和可部署性,本文件提议了一个零参数,零计算时间超位交叉模块(TSCM),并将它与2D组合合并成一个“TSCM+2D convolution”混合语言组合,使2D变动能够具有强大的空间-时空建模能力,具有零参数增加和部署成本较低的空间-时空建模功能。基于TSCMM的CSLLR总体模型建在改进的ResBlockT网络网络网络上。“TSC+2D convolution”混合组合用于形成新的ResBlockT,以及随机梯度停止和多层次的CT损失,用于培训模型,从高空间-时间级视频-内径网络上的竞争性测试结果,将SLVS最终识别任务升级为S的学习任务。