Sign languages are visual languages using manual articulations and non-manual elements to convey information. For sign language recognition and translation, the majority of existing approaches directly encode RGB videos into hidden representations. RGB videos, however, are raw signals with substantial visual redundancy, leading the encoder to overlook the key information for sign language understanding. To mitigate this problem and better incorporate domain knowledge, such as handshape and body movement, we introduce a dual visual encoder containing two separate streams to model both the raw videos and the keypoint sequences generated by an off-the-shelf keypoint estimator. To make the two streams interact with each other, we explore a variety of techniques, including bidirectional lateral connection, sign pyramid network with auxiliary supervision, and frame-level self-distillation. The resulting model is called TwoStream-SLR, which is competent for sign language recognition (SLR). TwoStream-SLR is extended to a sign language translation (SLT) model, TwoStream-SLT, by simply attaching an extra translation network. Experimentally, our TwoStream-SLR and TwoStream-SLT achieve state-of-the-art performance on SLR and SLT tasks across a series of datasets including Phoenix-2014, Phoenix-2014T, and CSL-Daily.
翻译:手语是使用手动表达和非手动要素传递信息的视觉语言。对于手语识别和翻译而言,大多数现有方法都直接将 RGB 视频编码成隐藏的表达式。 但是, RGB 视频是具有大量视觉冗余的原始信号,导致编码器忽略了关键信息,以便理解手语。为了缓解这一问题,更好地纳入域知识,例如手动和身体运动等,我们引入了包含两种不同流的双目编码器,以模拟原始视频和由现成关键点天顶器生成的关键点序列。为使两条流相互互动,我们探索了多种技术,包括双向横向连接、在辅助监督下签署金字塔网络和框架级自我蒸馏。由此形成的模型称为TwoStream-SLR(手动语言识别能力),双Stream-SLT(SLF)和SLFS-S-SLFS-S-SLFS-SLS-SLSLS-SLS-SLSDSDSDSDSDS-SDSDSAL)系列的手势翻译网络和SLLLT-SLLLT-SLT-SLLT-SLT-SLLLLSDSDSDSD-SD-SD-SDSLSDSDSDSDSDSDSDSDSDSDSLSDSDSDSDSDSDSDSDSDSDSDSDSDSDSDSDSDSDSDSDSDSDSDSDSDSDSDSDSDSDSDSDS-S-S-S-S-S-S-S-SDSDSDSDSDSDSDSDSDSDSDSDSDSDSDSDSDSDSLSLSLSLSLSLSDSDSDSDSDSLSDSDS-S-S-SLS-S-S-S-S-S-SDSDSDA,实验性工作,实验性、STA,实验性、ST