Sign language recognition (SLR) is a challenging problem, involving complex manual features, i.e., hand gestures, and fine-grained non-manual features (NMFs), i.e., facial expression, mouth shapes, etc. Although manual features are dominant, non-manual features also play an important role in the expression of a sign word. Specifically, many sign words convey different meanings due to non-manual features, even though they share the same hand gestures. This ambiguity introduces great challenges in the recognition of sign words. To tackle the above issue, we propose a simple yet effective architecture called Global-local Enhancement Network (GLE-Net), including two mutually promoted streams towards different crucial aspects of SLR. Of the two streams, one captures the global contextual relationship, while the other stream captures the discriminative fine-grained cues. Moreover, due to the lack of datasets explicitly focusing on this kind of features, we introduce the first non-manual-features-aware isolated Chinese sign language dataset~(NMFs-CSL) with a total vocabulary size of 1,067 sign words in daily life. Extensive experiments on NMFs-CSL and SLR500 datasets demonstrate the effectiveness of our method.
翻译:手语识别( SLR) 是一个具有挑战性的问题, 涉及复杂的手动特征, 即手势, 和精细的手动非手工特征( NMF), 即面部表达、 口部形状等。 虽然手语特征占主导地位, 非手语特征在表达一个符号词中也起着重要作用。 具体地说, 许多手语表达因非手语特征而具有不同的含义, 尽管它们有着相同的手势。 这种模糊性在识别手语上提出了巨大的挑战。 为了解决上述问题, 我们提出了一个简单而有效的结构, 叫做全球- 本地增强网络( GLE- Net), 包括两个相互促进的流向 SLR 不同关键方面( 即面部) 。 在两个流中, 一人捕捉了全球背景关系, 而另一流则捕捉了歧视性的精细缩提示。 此外, 由于缺少明确侧重于这种特征的数据集, 我们引入了第一个非手语孤立的中国孤立区域语言标识 ~ ( NLE- CSL) 的全词汇规模, 展示了我们每天生命中1 067 的SML 的数据。