This paper summarizes the JHU team's efforts in tracks 1 and 2 of the CHiME-6 challenge for distant multi-microphone conversational speech diarization and recognition in everyday home environments. We explore multi-array processing techniques at each stage of the pipeline, such as multi-array guided source separation (GSS) for enhancement and acoustic model training data, posterior fusion for speech activity detection, PLDA score fusion for diarization, and lattice combination for automatic speech recognition (ASR). We also report results with different acoustic model architectures, and integrate other techniques such as online multi-channel weighted prediction error (WPE) dereverberation and variational Bayes-hidden Markov model (VB-HMM) based overlap assignment to deal with reverberation and overlapping speakers, respectively. As a result of these efforts, our ASR systems achieve a word error rate of 40.5% and 67.5% on tracks 1 and 2, respectively, on the evaluation set. This is an improvement of 10.8% and 10.4% absolute, over the challenge baselines for the respective tracks.
翻译:本文总结了JHU团队在CHime-6轨道1和2对远程多声话话话话话话的分辨和在日常家庭环境中的识别方面所做的努力。我们探索了管道每个阶段的多片处理技术,例如用于增强和声学模型培训数据的多片导源分离(GSS),用于语音活动探测的后方聚合、用于分解的PLDA分数融合和用于自动语音识别的拉特斯组合(ASR)。我们还报告了不同声学模型结构的结果,并结合了其他技术,例如基于重音和重复发言的在线多声道加权预测错误(WPE)和变异波波波波射-希登马可夫模式(VB-HMM)等。由于这些努力,我们的ASR系统在评估集的轨道1和2上分别实现了40.5%和67.5%的字差率。这比各个轨道的挑战基线分别提高了10.8%和10.4%。