Automatic Cued Speech Recognition (ACSR) provides an intelligent human-machine interface for visual communications, where the Cued Speech (CS) system utilizes lip movements and hand gestures to code spoken language for hearing-impaired people. Previous ACSR approaches often utilize direct feature concatenation as the main fusion paradigm. However, the asynchronous modalities (\textit{i.e.}, lip, hand shape and hand position) in CS may cause interference for feature concatenation. To address this challenge, we propose a transformer based cross-modal mutual learning framework to prompt multi-modal interaction. Compared with the vanilla self-attention, our model forces modality-specific information of different modalities to pass through a modality-invariant codebook, collating linguistic representations for tokens of each modality. Then the shared linguistic knowledge is used to re-synchronize multi-modal sequences. Moreover, we establish a novel large-scale multi-speaker CS dataset for Mandarin Chinese. To our knowledge, this is the first work on ACSR for Mandarin Chinese. Extensive experiments are conducted for different languages (\textit{i.e.}, Chinese, French, and British English). Results demonstrate that our model exhibits superior recognition performance to the state-of-the-art by a large margin.
翻译:自动收音语音识别(ACSR)为视觉通信提供了一个智能的人类机器界面。 在这种界面中,收音话系统利用嘴唇运动和手势来为听力障碍者编码口语。 过去的ACSR方法经常使用直接特征连接作为主要的聚合模式。 但是, CS 中的非同步模式(\ textit{i.e.}, 嘴唇、手形和手姿势)可能会干扰特征共聚。 为了应对这一挑战,我们提议了一个基于跨模式的变压器相互学习框架,以加速多式互动。 与香草自留感相比,我们的模型模式是不同模式的信息,通过模式内变换代码书传递,对每种模式的象征进行语言表达。 然后,共享的语言知识可以对多模式序列进行再同步。 此外,我们为曼达林中文建立了一个新型的大型多语言CS数据集。 据我们所知,这是关于曼达林的ACSR的首项工作, 曼达林式的中国大片展示。