Sign Language Translation (SLT) is a promising technology to bridge the communication gap between the deaf and the hearing people. Recently, researchers have adopted Neural Machine Translation (NMT) methods, which usually require large-scale corpus for training, to achieve SLT. However, the publicly available SLT corpus is very limited, which causes the collapse of the token representations and the inaccuracy of the generated tokens. To alleviate this issue, we propose ConSLT, a novel token-level \textbf{Con}trastive learning framework for \textbf{S}ign \textbf{L}anguage \textbf{T}ranslation , which learns effective token representations by incorporating token-level contrastive learning into the SLT decoding process. Concretely, ConSLT treats each token and its counterpart generated by different dropout masks as positive pairs during decoding, and then randomly samples $K$ tokens in the vocabulary that are not in the current sentence to construct negative examples. We conduct comprehensive experiments on two benchmarks (PHOENIX14T and CSL-Daily) for both end-to-end and cascaded settings. The experimental results demonstrate that ConSLT can achieve better translation quality than the strong baselines.
翻译:手语翻译是一种重要的技术,可以弥补聋哑人和正常听觉人之间的沟通隔阂。最近,研究者采用了神经机器翻译(NMT)方法,但由于公开的手语翻译语料库非常有限,导致生成的标记表示失效和准确性低下。为了解决这个问题,我们提出了ConSLT,这是一种新颖的基于标记对比的手语翻译框架,通过将标记级对比学习纳入到翻译过程中来学习有效的标记表示。具体而言,ConSLT在解码过程中将每个标记及其通过不同的dropout masks生成的副本视为正对样本,然后随机从词汇表中抽取K个未在当前句子中出现的标记作为负样本。我们对两个基准数据集(PHOENIX14T 和 CSL-Daily)进行了全面的实验,分别进行了端到端和级联设置的比较。实验结果表明,ConSLT可以比强基线方法取得更好的翻译质量。