Sign language is used by deaf or speech impaired people to communicate, and requires great efforts to master. Sign Language Recognition (SLR) aims to make a bridge between sign language users and the others by recognize the word from given videos. It is a important yet challenging task since sign language is performed with fast and complex movement of hand gestures, body posture and even facial expressions. Recently, skeleton based action recognition attracts increasing attention due to the independence on subjects and background variations. It is also a strong complement to RGB/D modalities to further boost the overall recognition rate. However, skeleton based on SLR is still under exploration due to the lack of annotations on hand keypoints. Some efforts have been made to use hand detectors with pose estimator to extract hand keypoints, and learn to recognize sign language via a Recurrent Neural Network, but none of them outperforms RGB based methods. To this end, we propose a novel skeleton based SLR approach using whole-body keypoints with a universal multi-modal SLR framework (Uni-SLR) to further improve the recognition rate. Specifically, we propose a Graph Convolution Network (GCN) to model the embedded spatial relations and dynamic motions, and propose a novel Separable Spatial-Temporal Convolution Network (SSTCN) to exploit skeleton features. Our skeleton based method achieves a higher recognition rate compared with all other single modalities. Moreover, our proposed Uni-SLR framework can further enhance the performance by assembling our skeleton based method with other RGB and depth modalities. As a result, our Uni-SLR framework achieves the highest performance in both RGB (98.42\%) and RGB-D (98.53\%) tracks in 2021 Looking at People Large Scale Signer Independent Isolated SLR Challenge. Our code will be provided in \url{https://github.com/jackyjsy/CVPR21Chal-SLR}.
翻译:聋哑人或语言受损人使用手势语言进行交流,这需要做出巨大的努力才能掌握。手语识别(SLR)旨在通过识别特定视频中的单词在手语使用者和其他人之间架起桥梁。这是一个重要而具有挑战性的任务,因为手势、身体姿势甚至面部表达方式的快速和复杂的移动是手势、身体姿势、甚至面部表达式的动作。最近,由于在主题和背景变异上的独立性,基于骨架的行动识别吸引了越来越多的关注。它也是对RGB/D模式的有力补充,以进一步提高总体认知率。然而,基于SLRRR(SLR)的骨架仍然在探索中。我们提议使用带有配置显示显示显示显示显示显示显示显示显示显示显示显示显示显示显示显示显示显示显示显示显示显示的直径(GGRRRRRRRRRRRR)最高性能检测器手势的手动检测器,我们用SQOLS-S-SLRVS最高性能定位网络的SLRV(GRVS-RVS-S-S-Slalalalalal-SLVLS-S-S-S-ILVLVLM)模式,我们用S-S-ILVAL 和S-S-S-S-ILVLVLVLMS 以以以以显示显示我们S-RVFS-S-S-S-S-S-S-RVF-S-S-RVFS-S-RVLVDRVLVLVLVS-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-ILVLVLVAL-ILVAL-RVD-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-I-I