Sign language recognition (SLR) plays a crucial role in bridging the communication gap between the hearing and vocally impaired community and the rest of the society. Word-level sign language recognition (WSLR) is the first important step towards understanding and interpreting sign language. However, recognizing signs from videos is a challenging task as the meaning of a word depends on a combination of subtle body motions, hand configurations, and other movements. Recent pose-based architectures for WSLR either model both the spatial and temporal dependencies among the poses in different frames simultaneously or only model the temporal information without fully utilizing the spatial information. We tackle the problem of WSLR using a novel pose-based approach, which captures spatial and temporal information separately and performs late fusion. Our proposed architecture explicitly captures the spatial interactions in the video using a Graph Convolutional Network (GCN). The temporal dependencies between the frames are captured using Bidirectional Encoder Representations from Transformers (BERT). Experimental results on WLASL, a standard word-level sign language recognition dataset show that our model significantly outperforms the state-of-the-art on pose-based methods by achieving an improvement in the prediction accuracy by up to 5%.
翻译:手语识别( SLR) 在弥合听力和听力受损社区与社会其他人之间沟通差距方面发挥着关键作用。 Word级别手语识别( WSLR) 是理解和解释手语的第一个重要步骤。 然而,视频识别信号是一项艰巨的任务,因为一个单词的含义取决于微妙的身体动作、手动配置和其他运动的组合。WSLR的最近成形结构建构可以同时建模或只建模在不同框架中的成份之间的空间和时间依赖关系,不充分利用空间信息,我们使用新型的基于外观的方法解决WSLRR的问题,这种方法分别捕捉空间和时间信息,并进行晚期融合。我们提议的架构明确捕捉视频中的空间互动关系,使用图表革命网络(GCN)来捕捉框架之间的时间依赖性,使用来自变压器的双向互换导导导导导导导导导导导导导图(BERT) 。WLASL的实验结果,一个标准的字级信号语言识别数据集显示,我们的模型通过实现5号预测的精确度,大大超越了状态。