Sign Language (SL), as the mother tongue of the deaf community, is a special visual language that most hearing people cannot understand. In recent years, neural Sign Language Translation (SLT), as a possible way for bridging communication gap between the deaf and the hearing people, has attracted widespread academic attention. We found that the current mainstream end-to-end neural SLT models, which tries to learning language knowledge in a weakly supervised manner, could not mine enough semantic information under the condition of low data resources. Therefore, we propose to introduce additional word-level semantic knowledge of sign language linguistics to assist in improving current end-to-end neural SLT models. Concretely, we propose a novel neural SLT model with multi-modal feature fusion based on the dynamic graph, in which the cross-modal information, i.e. text and video, is first assembled as a dynamic graph according to their correlation, and then the graph is processed by a multi-modal graph encoder to generate the multi-modal embeddings for further usage in the subsequent neural translation models. To the best of our knowledge, we are the first to introduce graph neural networks, for fusing multi-modal information, into neural sign language translation models. Moreover, we conducted experiments on a publicly available popular SLT dataset RWTH-PHOENIX-Weather-2014T. and the quantitative experiments show that our method can improve the model.
翻译:作为聋人社区的母语,手势语言(SL)是聋人的母语,是大多数听力人无法理解的一种特殊视觉语言。近年来,神经手势语言翻译(SLT)作为弥合聋人和听力人之间沟通差距的可能途径,吸引了广泛的学术关注。我们发现,目前的主流端到端神经语言(SLT)模式试图以薄弱的监管方式学习语言知识,但在低数据资源的条件下,这种模式无法丰富足够的语义信息。因此,我们提议引入更多字级手语语语语语语语的语义化知识,以帮助改进当前的终端到终端神经神经语言SLT模式。具体地说,我们提出一个新的神经手语翻译模式,以动态图为基础,其中跨模式信息(即文字和视频)首先根据它们的相关性组合成一个动态图表,然后用多式图形图解密处理。我们用多模式嵌入了随后的神经翻译模型,我们用的是多式语言的虚拟模型。我们用最理想的模范式模型,我们用多式的模范式模型来展示了现在的公共数据。