This paper describes Microsoft's submission to the first shared task on sign language translation at WMT 2022, a public competition tackling sign language to spoken language translation for Swiss German sign language. The task is very challenging due to data scarcity and an unprecedented vocabulary size of more than 20k words on the target side. Moreover, the data is taken from real broadcast news, includes native signing and covers scenarios of long videos. Motivated by recent advances in action recognition, we incorporate full body information by extracting features from a pre-trained I3D model and applying a standard transformer network. The accuracy of the system is further improved by applying careful data cleaning on the target text. We obtain BLEU scores of 0.6 and 0.78 on the test and dev set respectively, which is the best score among the participants of the shared task. Also in the human evaluation the submission reaches the first place. The BLEU score is further improved to 1.08 on the dev set by applying features extracted from a lip reading model.
翻译:本文介绍微软在WMT 2022 手语翻译方面第一次共同任务,这是一场公开竞争,旨在解决手语对瑞士德文手语口语翻译的口语翻译问题。由于数据稀缺,而且目标方的词汇规模前所未有,超过20千字,任务非常艰巨。此外,数据取自真实广播新闻,包括本地签名和覆盖长视频的假想。在行动识别的最新进展的推动下,我们通过从经过预先训练的I3D模型中提取特征,并应用标准变压器网络,纳入了完整的体格信息。通过对目标文本进行仔细的数据清理,该系统的准确性得到进一步提高。我们在测试和设计上分别获得0.6和0.78的BLEU分数,这是共同任务参与者的最佳分数。在人类评估中,提交文件的得分也达到第一位。通过应用从唇读模型中提取的特征,使BLEU得分进一步提高到1.08。