关于长式多对话者录音的 " 端对端 " 发言人所属ASR (Hypothesis Stitcher for End-to-End Speaker-attributed ASR on Long-form Multi-talker Recordings) - 专知论文

会员服务 ·

0

E2E · 语音识别 · MoDELS · Performer · 端到端 ·

2021 年 1 月 6 日

Hypothesis Stitcher for End-to-End Speaker-attributed ASR on Long-form Multi-talker Recordings

翻译：关于长式多对话者录音的 " 端对端 " 发言人所属ASR

Xuankai Chang,Naoyuki Kanda,Yashesh Gaur,Xiaofei Wang,Zhong Meng,Takuya Yoshioka

from arxiv, Submitted to ICASSP 2021

An end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR) model was proposed recently to jointly perform speaker counting, speech recognition and speaker identification. The model achieved a low speaker-attributed word error rate (SA-WER) for monaural overlapped speech comprising an unknown number of speakers. However, the E2E modeling approach is susceptible to the mismatch between the training and testing conditions. It has yet to be investigated whether the E2E SA-ASR model works well for recordings that are much longer than samples seen during training. In this work, we first apply a known decoding technique that was developed to perform single-speaker ASR for long-form audio to our E2E SA-ASR task. Then, we propose a novel method using a sequence-to-sequence model, called hypothesis stitcher. The model takes multiple hypotheses obtained from short audio segments that are extracted from the original long-form input, and it then outputs a fused single hypothesis. We propose several architectural variations of the hypothesis stitcher model and compare them with the conventional decoding methods. Experiments using LibriSpeech and LibriCSS corpora show that the proposed method significantly improves SA-WER especially for long-form multi-talker recordings.

翻译：最近提议了一个端对端自动语音识别(E2E)语标模式,以联合进行语音计数、语音识别和语音识别。该模式实现了由数目不详的发言者组成的月度重叠发言的低语标致词错误率(SA-WER)。然而,E2E建模方法容易造成培训和测试条件之间的不匹配。尚未调查E2E SA-ASR模式对于比培训期间的样本要长得多的录音是否效果良好。在这项工作中,我们首先采用了一种已知的解码技术,用来对我们的E2E SA-ASR任务进行长式音频单语标ASR(SA-WER)。然后,我们提出了一种新颖的方法,使用顺序到顺序模型,称为假设缝合器。该模型采用了从从最初的长式输入中提取的短音频段获得的多个假设,然后产生了一个结合的单一假设。我们提出了一些假设缝合模型的建筑变换,并将它们与拟议的传统解码系统模型进行对比,特别是用于传统解剖式记录系统的实验室,从而大大改进了SA-CS-LS-LS-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SLAR-S-SLAR-S-S-SLAR-S-S-S-SLAR-S-S-SAR-S-S-S-S-S-S-S-S-SAR-S-S-S-S-S-S-S-SLAR-SAR-SAR-SAR-SAR-SLAR-SAR-SAR-SAR-SAR-S-SAR-SLAR-SAR-SAR-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SAR-SAR-SAR-SAR-SAR-S-SAR-SAR-SAR-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-

0

相关内容

E2E

纽约大学最新《语音识别Speech Recognition》2020课程，不可错过！

纽约大学最新《语音识别Speech Recognition》2020课程，不可错过！

专知会员服务

44+阅读 · 2020年11月2日

【ACL2020-亚马逊】Transformers多分辨率和多模态语音识别，Multiresolution and Multimodal Speech Recognition with Transformers

【ACL2020-亚马逊】Transformers多分辨率和多模态语音识别，Multiresolution and Multimodal Speech Recognition with Transformers

专知会员服务

15+阅读 · 2020年5月5日

【InterSpeech2020】混合语音识别系统中的词汇扩展技术，Techniques for Vocabulary Expansion in Hybrid Speech Recognition Systems

【InterSpeech2020】混合语音识别系统中的词汇扩展技术，Techniques for Vocabulary Expansion in Hybrid Speech Recognition Systems

专知会员服务

17+阅读 · 2020年3月23日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【Google Research】Wavesplit:通过说话者聚类实现端到端的语音分离，Wavesplit: End-to-End Speech Separation by Speaker Clustering

【Google Research】Wavesplit:通过说话者聚类实现端到端的语音分离，Wavesplit: End-to-End Speech Separation by Speaker Clustering

专知会员服务

19+阅读 · 2020年2月26日

【北邮-腾讯AI】自监督学习音视觉说话人认证，Self-supervised learning for audio-visual speaker diarization

【北邮-腾讯AI】自监督学习音视觉说话人认证，Self-supervised learning for audio-visual speaker diarization

专知会员服务

26+阅读 · 2020年2月16日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【NeurIPS2019】高性能浅层RNN的类脑目标识别（Brain-Like Object Recognition with High-Performing Shallow Recurrent ANNs）

【NeurIPS2019】高性能浅层RNN的类脑目标识别（Brain-Like Object Recognition with High-Performing Shallow Recurrent ANNs）

专知会员服务

13+阅读 · 2019年12月13日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Github项目推荐 | Emotion-recognition 实时表情识别

Github项目推荐 | Emotion-recognition 实时表情识别

AI科技评论

18+阅读 · 2019年7月8日

CVPR2019| 南开大学、Facebook等13篇CVPR论文及源码推荐（显著性检测/实例分割/人脸识别/视频动作识别等）

CVPR2019| 南开大学、Facebook等13篇CVPR论文及源码推荐（显著性检测/实例分割/人脸识别/视频动作识别等）

极市平台

11+阅读 · 2019年5月23日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

Facebook PyText 在 Github 上开源了

Facebook PyText 在 Github 上开源了

AINLP

7+阅读 · 2018年12月14日

CCF C类 | IJCNN 2019 Special Section : 信息论与深度学习

CCF C类 | IJCNN 2019 Special Section : 信息论与深度学习

Call4Papers

5+阅读 · 2018年12月7日

【跟踪Tracking】15篇论文+代码 | 中秋快乐~

【跟踪Tracking】15篇论文+代码 | 中秋快乐~

专知

18+阅读 · 2018年9月24日

【论文推荐】最新八篇主题模型相关论文—在线光谱学习、PAM变分推断、章节推荐、多芯片系统、文本分析、动态主题模型

【论文推荐】最新八篇主题模型相关论文—在线光谱学习、PAM变分推断、章节推荐、多芯片系统、文本分析、动态主题模型

专知

12+阅读 · 2018年5月6日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

上百份文字的检测与识别资源，包含数据集、code和paper

上百份文字的检测与识别资源，包含数据集、code和paper

数据挖掘入门与实战

17+阅读 · 2017年12月7日

Transfer Learning based Speech Affect Recognition in Urdu

Arxiv

0+阅读 · 2021年3月5日

End-to-End Multi-speaker Speech Recognition with Transformer

Arxiv

8+阅读 · 2020年2月13日

Meta Learning for End-to-End Low-Resource Speech Recognition

Meta Learning for End-to-End Low-Resource Speech Recognition

Arxiv

20+阅读 · 2019年10月26日

Exploring RNN-Transducer for Chinese Speech Recognition

Arxiv

4+阅读 · 2019年4月23日

Neural Speech Synthesis with Transformer Network

Neural Speech Synthesis with Transformer Network

Arxiv

5+阅读 · 2019年1月30日

End-to-End Video Captioning with Multitask Reinforcement Learning

End-to-End Video Captioning with Multitask Reinforcement Learning

Arxiv

3+阅读 · 2019年1月1日

End-to-end Speech Recognition with Word-based RNN Language Models

End-to-end Speech Recognition with Word-based RNN Language Models

Arxiv

3+阅读 · 2018年8月8日

Unified Hypersphere Embedding for Speaker Recognition

Arxiv

5+阅读 · 2018年7月22日

Joint Recognition of Handwritten Text and Named Entities with a Neural End-to-end Model

Arxiv

6+阅读 · 2018年3月22日

State-of-the-art Speech Recognition With Sequence-to-Sequence Models

Arxiv

7+阅读 · 2018年1月18日

VIP会员

文章信息

相关主题

相关VIP内容

纽约大学最新《语音识别Speech Recognition》2020课程，不可错过！

纽约大学最新《语音识别Speech Recognition》2020课程，不可错过！

专知会员服务

44+阅读 · 2020年11月2日

【ACL2020-亚马逊】Transformers多分辨率和多模态语音识别，Multiresolution and Multimodal Speech Recognition with Transformers

【ACL2020-亚马逊】Transformers多分辨率和多模态语音识别，Multiresolution and Multimodal Speech Recognition with Transformers

专知会员服务

15+阅读 · 2020年5月5日

【InterSpeech2020】混合语音识别系统中的词汇扩展技术，Techniques for Vocabulary Expansion in Hybrid Speech Recognition Systems

【InterSpeech2020】混合语音识别系统中的词汇扩展技术，Techniques for Vocabulary Expansion in Hybrid Speech Recognition Systems

专知会员服务

17+阅读 · 2020年3月23日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【Google Research】Wavesplit:通过说话者聚类实现端到端的语音分离，Wavesplit: End-to-End Speech Separation by Speaker Clustering

【Google Research】Wavesplit:通过说话者聚类实现端到端的语音分离，Wavesplit: End-to-End Speech Separation by Speaker Clustering

专知会员服务

19+阅读 · 2020年2月26日

【北邮-腾讯AI】自监督学习音视觉说话人认证，Self-supervised learning for audio-visual speaker diarization

【北邮-腾讯AI】自监督学习音视觉说话人认证，Self-supervised learning for audio-visual speaker diarization

专知会员服务

26+阅读 · 2020年2月16日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【NeurIPS2019】高性能浅层RNN的类脑目标识别（Brain-Like Object Recognition with High-Performing Shallow Recurrent ANNs）

【NeurIPS2019】高性能浅层RNN的类脑目标识别（Brain-Like Object Recognition with High-Performing Shallow Recurrent ANNs）

专知会员服务

13+阅读 · 2019年12月13日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

新书册《几何深度学习的数学基础》

中程单向攻击无人机的战略意义：俄乌战争启示

在无标注条件下适配视觉—语言模型：全面综述

面向视觉语言模型的持续学习：遗忘之外的综述与分类体系

相关资讯

Github项目推荐 | Emotion-recognition 实时表情识别

Github项目推荐 | Emotion-recognition 实时表情识别

AI科技评论

18+阅读 · 2019年7月8日

CVPR2019| 南开大学、Facebook等13篇CVPR论文及源码推荐（显著性检测/实例分割/人脸识别/视频动作识别等）

CVPR2019| 南开大学、Facebook等13篇CVPR论文及源码推荐（显著性检测/实例分割/人脸识别/视频动作识别等）

极市平台

11+阅读 · 2019年5月23日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

Facebook PyText 在 Github 上开源了

Facebook PyText 在 Github 上开源了

AINLP

7+阅读 · 2018年12月14日

CCF C类 | IJCNN 2019 Special Section : 信息论与深度学习

CCF C类 | IJCNN 2019 Special Section : 信息论与深度学习

Call4Papers

5+阅读 · 2018年12月7日

【跟踪Tracking】15篇论文+代码 | 中秋快乐~

【跟踪Tracking】15篇论文+代码 | 中秋快乐~

专知

18+阅读 · 2018年9月24日

【论文推荐】最新八篇主题模型相关论文—在线光谱学习、PAM变分推断、章节推荐、多芯片系统、文本分析、动态主题模型

【论文推荐】最新八篇主题模型相关论文—在线光谱学习、PAM变分推断、章节推荐、多芯片系统、文本分析、动态主题模型

专知

12+阅读 · 2018年5月6日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

上百份文字的检测与识别资源，包含数据集、code和paper

上百份文字的检测与识别资源，包含数据集、code和paper

数据挖掘入门与实战

17+阅读 · 2017年12月7日

相关论文

Transfer Learning based Speech Affect Recognition in Urdu

Arxiv

0+阅读 · 2021年3月5日

End-to-End Multi-speaker Speech Recognition with Transformer

Arxiv

8+阅读 · 2020年2月13日

Meta Learning for End-to-End Low-Resource Speech Recognition

Meta Learning for End-to-End Low-Resource Speech Recognition

Arxiv

20+阅读 · 2019年10月26日

Exploring RNN-Transducer for Chinese Speech Recognition

Arxiv

4+阅读 · 2019年4月23日

Neural Speech Synthesis with Transformer Network

Neural Speech Synthesis with Transformer Network

Arxiv

5+阅读 · 2019年1月30日

End-to-End Video Captioning with Multitask Reinforcement Learning

End-to-End Video Captioning with Multitask Reinforcement Learning

Arxiv

3+阅读 · 2019年1月1日

End-to-end Speech Recognition with Word-based RNN Language Models

End-to-end Speech Recognition with Word-based RNN Language Models

Arxiv

3+阅读 · 2018年8月8日

Unified Hypersphere Embedding for Speaker Recognition

Arxiv

5+阅读 · 2018年7月22日

Joint Recognition of Handwritten Text and Named Entities with a Neural End-to-end Model

Arxiv

6+阅读 · 2018年3月22日

State-of-the-art Speech Recognition With Sequence-to-Sequence Models

Arxiv

7+阅读 · 2018年1月18日

微信扫码咨询专知VIP会员