Sign language recognition and translation first uses a recognition module to generate glosses from sign language videos and then employs a translation module to translate glosses into spoken sentences. Most existing works focus on the recognition step, while paying less attention to sign language translation. In this work, we propose a task-aware instruction network, namely TIN-SLT, for sign language translation, by introducing the instruction module and the learning-based feature fuse strategy into a Transformer network. In this way, the pre-trained model's language ability can be well explored and utilized to further boost the translation performance. Moreover, by exploring the representation space of sign language glosses and target spoken language, we propose a multi-level data augmentation scheme to adjust the data distribution of the training set. We conduct extensive experiments on two challenging benchmark datasets, PHOENIX-2014-T and ASLG-PC12, on which our method outperforms former best solutions by 1.65 and 1.42 in terms of BLEU-4. Our code is published at https://github.com/yongcaoplus/TIN-SLT.
翻译:手语识别和翻译首先使用一个识别模块,从手语视频中产生符号,然后使用一个翻译模块,将符号转换成口语。多数现有工作侧重于识别步骤,而较少注意手语翻译。在这项工作中,我们提议建立一个任务认知教学网络,即TIN-SLT,用于手语翻译,方法是将教学模块和学习功能引信战略引入一个变异器网络。这样,可以很好地探索和利用预先培训的模型语言能力,以进一步提高翻译性能。此外,通过探索手语符号和目标口语的表达空间,我们提出了一个多层次的数据增强计划,以调整成套培训的数据分配。我们在两个具有挑战性的基准数据集(PHOENIX-2014-T和ASLG-PC12)上进行了广泛的实验,我们的方法在BLEU-4中以1.65和1.42取代了以前的最佳解决方案。我们的代码在https://github.com/yongcauplus/TIN-SLT上公布。