Translating spoken languages into Sign languages is necessary for open communication between the hearing and hearing-impaired communities. To achieve this goal, we propose the first method for animating a text written in HamNoSys, a lexical Sign language notation, into signed pose sequences. As HamNoSys is universal by design, our proposed method offers a generic solution invariant to the target Sign language. Our method gradually generates pose predictions using transformer encoders that create meaningful representations of the text and poses while considering their spatial and temporal information. We use weak supervision for the training process and show that our method succeeds in learning from partial and inaccurate data. Additionally, we offer a new distance measurement that considers missing keypoints, to measure the distance between pose sequences using DTW-MJE. We validate its correctness using AUTSL, a large-scale Sign language dataset, show that it measures the distance between pose sequences more accurately than existing measurements, and use it to assess the quality of our generated pose sequences. Code for the data pre-processing, the model, and the distance measurement is publicly released for future research.
翻译:翻译口语语言为手语,这对听觉和听觉受损社区之间的沟通至关重要。为了实现这一目标,我们提出了第一种方法,将使用 HamNoSys,一种手语符号标记法,书写的文本动画化为姿势序列。由于 HamNoSys 的设计是通用的,我们的提议方法提供了一种不受目标手语影响的通用解决方案。我们的方法使用变压器编码器渐进生成姿势预测,同时考虑它们的空间和时间信息,创建文本和姿势的有意义表征。我们使用弱监督来进行训练,并显示我们的方法成功从部分和不准确的数据中学习。此外,我们提供一种新的距离度量方法,可以考虑缺失关键点,使用 DTW-MJE 计算姿势序列之间的距离。我们使用 AUTSL,一种大规模手语数据集,验证其正确性,并证明其测量姿势序列之间的距离比现有测量方法更准确,并用它来评估我们生成的姿势序列的质量。数据预处理、模型和距离度量的代码已公开释放,供未来研究使用。