以跨模式变形器为基础的自动语音识别神经校正模型 (Cross-Modal Transformer-Based Neural Correction Models for Automatic Speech Recognition) - 专知论文

会员服务 ·

0

语音识别 · MoDELS · 自动语音识别 · 模态 · Networks ·

2021 年 7 月 4 日

Cross-Modal Transformer-Based Neural Correction Models for Automatic Speech Recognition

翻译：以跨模式变形器为基础的自动语音识别神经校正模型

Tomohiro Tanaka,Ryo Masumura,Mana Ihori,Akihiko Takashima,Takafumi Moriya,Takanori Ashihara,Shota Orihashi,Naoki Makishima

from arxiv, Accepted to Interspeech 2021

We propose a cross-modal transformer-based neural correction models that refines the output of an automatic speech recognition (ASR) system so as to exclude ASR errors. Generally, neural correction models are composed of encoder-decoder networks, which can directly model sequence-to-sequence mapping problems. The most successful method is to use both input speech and its ASR output text as the input contexts for the encoder-decoder networks. However, the conventional method cannot take into account the relationships between these two different modal inputs because the input contexts are separately encoded for each modal. To effectively leverage the correlated information between the two different modal inputs, our proposed models encode two different contexts jointly on the basis of cross-modal self-attention using a transformer. We expect that cross-modal self-attention can effectively capture the relationships between two different modals for refining ASR hypotheses. We also introduce a shallow fusion technique to efficiently integrate the first-pass ASR model and our proposed neural correction model. Experiments on Japanese natural language ASR tasks demonstrated that our proposed models achieve better ASR performance than conventional neural correction models.

翻译：我们建议一种基于跨模式变压器的神经校正模型,改进自动语音识别系统(ASR)的输出,从而排除ASR错误。一般来说,神经校正模型由编码器-解码器网络组成,可以直接模拟序列到序列的绘图问题。最成功的方法是使用输入语音及其ASR输出文本作为编码器-解码器网络的输入背景。然而,常规方法不能考虑到这两个不同的模式输入输入输入输入器之间的关系,因为输入环境对每个模式分别进行了编码。为了有效地利用两种不同的模式输入之间的关联信息,我们提议的模型在使用变压器的跨模式自我注意的基础上将两种不同的环境共同编码。我们期望,交叉模式自我注意能够有效地捕捉两种不同的模型之间的关系,用于改进ASR假设。我们还采用了一种浅的融合技术,以便有效地将第一流的ASR模型和我们提议的神经校正模型结合起来。在日本的自然语言上实验ASR任务表明,我们提议的模型比常规的校正模型更好地实现了ASR性。

0

相关内容

语音识别

语音识别是计算机科学和计算语言学的一个跨学科子领域，它发展了一些方法和技术，使计算机可以将口语识别和翻译成文本。它也被称为自动语音识别（ASR），计算机语音识别或语音转文本（STT）。它整合了计算机科学，语言学和计算机工程领域的知识和研究。

纽约大学最新《语音识别Speech Recognition》2020课程，不可错过！

纽约大学最新《语音识别Speech Recognition》2020课程，不可错过！

专知会员服务

44+阅读 · 2020年11月2日

【ACL2020-亚马逊】Transformers多分辨率和多模态语音识别，Multiresolution and Multimodal Speech Recognition with Transformers

【ACL2020-亚马逊】Transformers多分辨率和多模态语音识别，Multiresolution and Multimodal Speech Recognition with Transformers

专知会员服务

15+阅读 · 2020年5月5日

【哈佛-ICLR2020】基于残差能量模型的文本生成，Residual Energy-Based Models for Text Generation

【哈佛-ICLR2020】基于残差能量模型的文本生成，Residual Energy-Based Models for Text Generation

专知会员服务

11+阅读 · 2020年4月27日

Transformer文本分类代码

Transformer文本分类代码

专知会员服务

118+阅读 · 2020年2月3日

【Yoshua Bengio新论文】多任务自监督学习语音识别，MULTI-TASK SELF-SUPERVISED LEARNING FOR ROBUST SPEECH RECOGNITION

【Yoshua Bengio新论文】多任务自监督学习语音识别，MULTI-TASK SELF-SUPERVISED LEARNING FOR ROBUST SPEECH RECOGNITION

专知会员服务

39+阅读 · 2020年1月30日

【中科院自动化所】序列到序列语音识别的无监督预训练（Unsupervised pre-training for sequence to sequence speech recognition）

【中科院自动化所】序列到序列语音识别的无监督预训练（Unsupervised pre-training for sequence to sequence speech recognition）

专知会员服务

33+阅读 · 2020年1月5日

【论文】使用编码器进行命名实体识别（TENER: Adapting Transformer Encoder for Named Entity Recognition）

【论文】使用编码器进行命名实体识别（TENER: Adapting Transformer Encoder for Named Entity Recognition）

专知会员服务

52+阅读 · 2019年12月28日

【AAAI2020】多模态注意力语义图嵌入多标签分类（Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification）

【AAAI2020】多模态注意力语义图嵌入多标签分类（Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification）

专知会员服务

92+阅读 · 2019年12月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

已删除

将门创投

5+阅读 · 2019年5月5日

CCF C类 | IJCNN 2019 Special Section : 信息论与深度学习

CCF C类 | IJCNN 2019 Special Section : 信息论与深度学习

Call4Papers

5+阅读 · 2018年12月7日

【论文推荐】最新六篇图像描述生成相关论文—字符级推断、视觉解释、语义对齐、实体感知、确定性非自回归

【论文推荐】最新六篇图像描述生成相关论文—字符级推断、视觉解释、语义对齐、实体感知、确定性非自回归

专知

15+阅读 · 2018年5月28日

Capsule Networks解析

Capsule Networks解析

机器学习研究会

11+阅读 · 2017年11月12日

Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition

Arxiv

2+阅读 · 2021年8月31日

Speech2Action: Cross-modal Supervision for Action Recognition

Speech2Action: Cross-modal Supervision for Action Recognition

Arxiv

7+阅读 · 2020年3月30日

End-to-End Multi-speaker Speech Recognition with Transformer

Arxiv

8+阅读 · 2020年2月13日

Language Modeling with Deep Transformers

Arxiv

6+阅读 · 2019年7月11日

Exploring RNN-Transducer for Chinese Speech Recognition

Arxiv

4+阅读 · 2019年4月23日

A CNN-RNN Framework with a Novel Patch-Based Multi-Attention Mechanism for Multi-Label Image Classification in Remote Sensing

A CNN-RNN Framework with a Novel Patch-Based Multi-Attention Mechanism for Multi-Label Image Classification in Remote Sensing

Arxiv

3+阅读 · 2019年2月28日

Neural Speech Synthesis with Transformer Network

Neural Speech Synthesis with Transformer Network

Arxiv

5+阅读 · 2019年1月30日

Neural source-filter-based waveform model for statistical parametric speech synthesis

Arxiv

4+阅读 · 2018年11月26日

Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese

Arxiv

5+阅读 · 2018年6月4日

State-of-the-art Speech Recognition With Sequence-to-Sequence Models

Arxiv

7+阅读 · 2018年1月18日

VIP会员

文章信息

相关主题

自动语音识别

相关VIP内容

纽约大学最新《语音识别Speech Recognition》2020课程，不可错过！

纽约大学最新《语音识别Speech Recognition》2020课程，不可错过！

专知会员服务

44+阅读 · 2020年11月2日

【ACL2020-亚马逊】Transformers多分辨率和多模态语音识别，Multiresolution and Multimodal Speech Recognition with Transformers

【ACL2020-亚马逊】Transformers多分辨率和多模态语音识别，Multiresolution and Multimodal Speech Recognition with Transformers

专知会员服务

15+阅读 · 2020年5月5日

【哈佛-ICLR2020】基于残差能量模型的文本生成，Residual Energy-Based Models for Text Generation

【哈佛-ICLR2020】基于残差能量模型的文本生成，Residual Energy-Based Models for Text Generation

专知会员服务

11+阅读 · 2020年4月27日

Transformer文本分类代码

Transformer文本分类代码

专知会员服务

118+阅读 · 2020年2月3日

【Yoshua Bengio新论文】多任务自监督学习语音识别，MULTI-TASK SELF-SUPERVISED LEARNING FOR ROBUST SPEECH RECOGNITION

【Yoshua Bengio新论文】多任务自监督学习语音识别，MULTI-TASK SELF-SUPERVISED LEARNING FOR ROBUST SPEECH RECOGNITION

专知会员服务

39+阅读 · 2020年1月30日

【中科院自动化所】序列到序列语音识别的无监督预训练（Unsupervised pre-training for sequence to sequence speech recognition）

【中科院自动化所】序列到序列语音识别的无监督预训练（Unsupervised pre-training for sequence to sequence speech recognition）

专知会员服务

33+阅读 · 2020年1月5日

【论文】使用编码器进行命名实体识别（TENER: Adapting Transformer Encoder for Named Entity Recognition）

【论文】使用编码器进行命名实体识别（TENER: Adapting Transformer Encoder for Named Entity Recognition）

专知会员服务

52+阅读 · 2019年12月28日

【AAAI2020】多模态注意力语义图嵌入多标签分类（Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification）

【AAAI2020】多模态注意力语义图嵌入多标签分类（Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification）

专知会员服务

92+阅读 · 2019年12月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

《乌克兰无人机产业：志愿者与政策在构建新兴无人机产业中的协同作用》最新报告

《人工智能辅助决策中的数据可视化：系统性综述》

人工智能驱动弹药制造现代化：美国陆军转型之路

《敏捷作战部署中枢纽-辐条基地选址优化研究》80页

相关资讯

已删除

将门创投

5+阅读 · 2019年5月5日

CCF C类 | IJCNN 2019 Special Section : 信息论与深度学习

CCF C类 | IJCNN 2019 Special Section : 信息论与深度学习

Call4Papers

5+阅读 · 2018年12月7日

【论文推荐】最新六篇图像描述生成相关论文—字符级推断、视觉解释、语义对齐、实体感知、确定性非自回归

【论文推荐】最新六篇图像描述生成相关论文—字符级推断、视觉解释、语义对齐、实体感知、确定性非自回归

专知

15+阅读 · 2018年5月28日

Capsule Networks解析

Capsule Networks解析

机器学习研究会

11+阅读 · 2017年11月12日

相关论文

Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition

Arxiv

2+阅读 · 2021年8月31日

Speech2Action: Cross-modal Supervision for Action Recognition

Speech2Action: Cross-modal Supervision for Action Recognition

Arxiv

7+阅读 · 2020年3月30日

End-to-End Multi-speaker Speech Recognition with Transformer

Arxiv

8+阅读 · 2020年2月13日

Language Modeling with Deep Transformers

Arxiv

6+阅读 · 2019年7月11日

Exploring RNN-Transducer for Chinese Speech Recognition

Arxiv

4+阅读 · 2019年4月23日

A CNN-RNN Framework with a Novel Patch-Based Multi-Attention Mechanism for Multi-Label Image Classification in Remote Sensing

A CNN-RNN Framework with a Novel Patch-Based Multi-Attention Mechanism for Multi-Label Image Classification in Remote Sensing

Arxiv

3+阅读 · 2019年2月28日

Neural Speech Synthesis with Transformer Network

Neural Speech Synthesis with Transformer Network

Arxiv

5+阅读 · 2019年1月30日

Neural source-filter-based waveform model for statistical parametric speech synthesis

Arxiv

4+阅读 · 2018年11月26日

Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese

Arxiv

5+阅读 · 2018年6月4日

State-of-the-art Speech Recognition With Sequence-to-Sequence Models

Arxiv

7+阅读 · 2018年1月18日

微信扫码咨询专知VIP会员