Hand and face play an important role in expressing sign language. Their features are usually especially leveraged to improve system performance. However, to effectively extract visual representations and capture trajectories for hands and face, previous methods always come at high computations with increased training complexity. They usually employ extra heavy pose-estimation networks to locate human body keypoints or rely on additional pre-extracted heatmaps for supervision. To relieve this problem, we propose a self-emphasizing network (SEN) to emphasize informative spatial regions in a self-motivated way, with few extra computations and without additional expensive supervision. Specifically, SEN first employs a lightweight subnetwork to incorporate local spatial-temporal features to identify informative regions, and then dynamically augment original features via attention maps. It's also observed that not all frames contribute equally to recognition. We present a temporal self-emphasizing module to adaptively emphasize those discriminative frames and suppress redundant ones. A comprehensive comparison with previous methods equipped with hand and face features demonstrates the superiority of our method, even though they always require huge computations and rely on expensive extra supervision. Remarkably, with few extra computations, SEN achieves new state-of-the-art accuracy on four large-scale datasets, PHOENIX14, PHOENIX14-T, CSL-Daily, and CSL. Visualizations verify the effects of SEN on emphasizing informative spatial and temporal features. Code is available at https://github.com/hulianyuyy/SEN_CSLR
翻译:手和面部在表达手语方面起着重要作用。 它们的特点通常被特别用来改善系统性能。 但是,为了有效地提取视觉表现和捕捉手和面部的轨迹,以往的方法总是在高的计算中出现,培训的复杂性增加。它们通常使用超重的表面估计网络来定位人体的键点,或者依赖额外的事先提取的热量图来进行监督。为了缓解这一问题,我们提议一个自我强调的网络(SEN),以自我激励的方式强调信息丰富的空间区域,而没有额外的计算和额外的昂贵监督。具体地说,SEN首先使用一个轻量的子网络,以纳入当地的空间时空特征来识别信息性区域,然后通过关注地图动态地增强原始特征。它们还发现,并非所有的框架都同样有助于认知。 我们提出了一个时间自我强调的模块,以适应方式强调这些歧视性框架并抑制冗余的。 与以前装有手和面特征的方法的全面比较显示了我们的方法的优越性,尽管它们总是需要巨大的计算和依赖昂贵的外部监督。 RED-D-SLSLS-SL-S/SLO