An end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR) model was proposed recently to jointly perform speaker counting, speech recognition and speaker identification. The model achieved a low speaker-attributed word error rate (SA-WER) for monaural overlapped speech comprising an unknown number of speakers. However, the E2E modeling approach is susceptible to the mismatch between the training and testing conditions. It has yet to be investigated whether the E2E SA-ASR model works well for recordings that are much longer than samples seen during training. In this work, we first apply a known decoding technique that was developed to perform single-speaker ASR for long-form audio to our E2E SA-ASR task. Then, we propose a novel method using a sequence-to-sequence model, called hypothesis stitcher. The model takes multiple hypotheses obtained from short audio segments that are extracted from the original long-form input, and it then outputs a fused single hypothesis. We propose several architectural variations of the hypothesis stitcher model and compare them with the conventional decoding methods. Experiments using LibriSpeech and LibriCSS corpora show that the proposed method significantly improves SA-WER especially for long-form multi-talker recordings.


翻译:最近提议了一个端对端自动语音识别(E2E)语标模式,以联合进行语音计数、语音识别和语音识别。该模式实现了由数目不详的发言者组成的月度重叠发言的低语标致词错误率(SA-WER)。然而,E2E建模方法容易造成培训和测试条件之间的不匹配。尚未调查E2E SA-ASR模式对于比培训期间的样本要长得多的录音是否效果良好。在这项工作中,我们首先采用了一种已知的解码技术,用来对我们的E2E SA-ASR任务进行长式音频单语标ASR(SA-WER)。然后,我们提出了一种新颖的方法,使用顺序到顺序模型,称为假设缝合器。该模型采用了从从最初的长式输入中提取的短音频段获得的多个假设,然后产生了一个结合的单一假设。我们提出了一些假设缝合模型的建筑变换,并将它们与拟议的传统解码系统模型进行对比,特别是用于传统解剖式记录系统的实验室,从而大大改进了SA-CS-LS-LS-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SLAR-S-SLAR-S-S-SLAR-S-S-S-SLAR-S-S-SAR-S-S-S-S-S-S-S-S-SAR-S-S-S-S-S-S-S-SLAR-SAR-SAR-SAR-SAR-SLAR-SAR-SAR-SAR-SAR-S-SAR-SLAR-SAR-SAR-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SAR-SAR-SAR-SAR-SAR-S-SAR-SAR-SAR-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-

0
下载
关闭预览

相关内容

Github项目推荐 | Emotion-recognition 实时表情识别
AI科技评论
18+阅读 · 2019年7月8日
Transferring Knowledge across Learning Processes
CreateAMind
28+阅读 · 2019年5月18日
Unsupervised Learning via Meta-Learning
CreateAMind
42+阅读 · 2019年1月3日
Facebook PyText 在 Github 上开源了
AINLP
7+阅读 · 2018年12月14日
【跟踪Tracking】15篇论文+代码 | 中秋快乐~
专知
18+阅读 · 2018年9月24日
Hierarchical Disentangled Representations
CreateAMind
4+阅读 · 2018年4月15日
上百份文字的检测与识别资源,包含数据集、code和paper
数据挖掘入门与实战
17+阅读 · 2017年12月7日
Neural Speech Synthesis with Transformer Network
Arxiv
5+阅读 · 2019年1月30日
VIP会员
相关VIP内容
Top
微信扫码咨询专知VIP会员