Millions of hearing impaired people around the world routinely use some variants of sign languages to communicate, thus the automatic translation of a sign language is meaningful and important. Currently, there are two sub-problems in Sign Language Recognition (SLR), i.e., isolated SLR that recognizes word by word and continuous SLR that translates entire sentences. Existing continuous SLR methods typically utilize isolated SLRs as building blocks, with an extra layer of preprocessing (temporal segmentation) and another layer of post-processing (sentence synthesis). Unfortunately, temporal segmentation itself is non-trivial and inevitably propagates errors into subsequent steps. Worse still, isolated SLR methods typically require strenuous labeling of each word separately in a sentence, severely limiting the amount of attainable training data. To address these challenges, we propose a novel continuous sign recognition framework, the Hierarchical Attention Network with Latent Space (LS-HAN), which eliminates the preprocessing of temporal segmentation. The proposed LS-HAN consists of three components: a two-stream Convolutional Neural Network (CNN) for video feature representation generation, a Latent Space (LS) for semantic gap bridging, and a Hierarchical Attention Network (HAN) for latent space based recognition. Experiments are carried out on two large scale datasets. Experimental results demonstrate the effectiveness of the proposed framework.
翻译:世界各地数百万听力受损者经常使用手语的某些变体进行交流,因此手语的自动翻译是有意义和重要的。目前,手语识别有两个次级问题,即孤立的SLR,用文字识别单词,用连续的SLR来翻译整个句子。现有的连续的SLR方法通常使用孤立的SLR作为构件,加一层预处理(时分分化)和另一层后处理(感应合成)。不幸的是,时间分割本身是非三流的,不可避免地会将错误传播到以后的步骤中。更糟糕的是,孤立的SLR方法通常需要在句子上对每个词分别作出艰苦的标签,严重限制可实现的培训数据数量。为了应对这些挑战,我们提议一个新的连续标志识别框架,即拥有冷藏空间的高度关注网络(LS-HAN),消除了对时间分解的预处理。拟议的LS-HAN由三个部分组成:用于视频特征生成的双流神经网络、以静态空间定位为主的大规模空间实验性实验结果框架(LS),这是以静态空间定位为缓冲的大规模实验性实验结果。