利用视听、文字、跨模式学习的音乐背景代表,促进交流性ASR (Leveraging Acoustic Contextual Representation by Audio-textual Cross-modal Learning for Conversational ASR)

Leveraging context information is an intuitive idea to improve performance on conversational automatic speech recognition(ASR). Previous works usually adopt recognized hypotheses of historical utterances as preceding context, which may bias the current recognized hypothesis due to the inevitable historicalrecognition errors. To avoid this problem, we propose an audio-textual cross-modal representation extractor to learn contextual representations directly from preceding speech. Specifically, it consists of two modal-related encoders, extracting high-level latent features from speech and the corresponding text, and a cross-modal encoder, which aims to learn the correlation between speech and text. We randomly mask some input tokens and input sequences of each modality. Then a token-missing or modal-missing prediction with a modal-level CTC loss on the cross-modal encoder is performed. Thus, the model captures not only the bi-directional context dependencies in a specific modality but also relationships between different modalities. Then, during the training of the conversational ASR system, the extractor will be frozen to extract the textual representation of preceding speech, while such representation is used as context fed to the ASR decoder through attention mechanism. The effectiveness of the proposed approach is validated on several Mandarin conversation corpora and the highest character error rate (CER) reduction up to 16% is achieved on the MagicData dataset.

翻译：利用背景信息是一种直观的想法,可以提高谈话自动语音识别(ASR)的性能。以往的工作通常采用公认的历史话语假设作为前一种背景,这种假设可能由于不可避免的历史识别错误而偏向当前公认的假设;为避免这一问题,我们建议采用音频文本跨模式代表提取器,直接从前面的演讲中学习背景表述。具体地说,它包括两个与模式有关的编码器,从演讲和对应文本中提取高潜伏特征,以及一个跨模式编码器,目的是学习语音和文本之间的相互关系。我们随机掩盖了每种模式的某些输入符号和输入序列。然后,我们随机地掩盖了象征性发布或模式发布预测,同时对跨模式编码编码进行模式化的CT损失。因此,模型不仅捕捉了特定模式的双向背景,而且在不同模式间的关系。随后,在对话的ASR系统培训期间,提取器将冻结在先前演讲的文字表达方式和输入某些输入的符号符号符号符号符号和输入顺序顺序顺序序列中,同时将Arquestal ora ora oral ora oral oral la la la la la or la la la la la la or la la la la la la la la la la la la la la la or or or la la la la la la la la la la la la la la or la la la la la or or or or or or or or la la la la la la or la la la la or or or la or or or or or or or or or or or or or or or or la la la or or or or or or la la la la la la la la la la la la la la la la la la la la la or la la la la la la la la or or or or or