In this paper, we present AISHELL-4, a sizable real-recorded Mandarin speech dataset collected by 8-channel circular microphone array for speech processing in conference scenario. The dataset consists of 211 recorded meeting sessions, each containing 4 to 8 speakers, with a total length of 120 hours. This dataset aims to bridge the advanced research on multi-speaker processing and the practical application scenario in three aspects. With real recorded meetings, AISHELL-4 provides realistic acoustics and rich natural speech characteristics in conversation such as short pause, speech overlap, quick speaker turn, noise, etc. Meanwhile, accurate transcription and speaker voice activity are provided for each meeting in AISHELL-4. This allows the researchers to explore different aspects in meeting processing, ranging from individual tasks such as speech front-end processing, speech recognition and speaker diarization, to multi-modality modeling and joint optimization of relevant tasks. Given most open source dataset for multi-speaker tasks are in English, AISHELL-4 is the only Mandarin dataset for conversation speech, providing additional value for data diversity in speech community. We also release a PyTorch-based training and evaluation framework as baseline system to promote reproducible research in this field.
翻译:在本文中,我们介绍AISHELL-4,这是一个由8个循环圆环麦克风式麦克风阵列收集的、在会议场景中进行语音处理的大规模真实记录的普通话数据集。数据集由211个记录的会议组成,每次会议有4至8个发言者,总长度120小时。该数据集旨在将关于多声处理的先进研究以及实际应用设想的三个方面连接起来。在实际记录的会议中,AISHELL-4提供现实的声学和丰富的自然语言特征,如短暂暂停、发言重叠、语音转转转转、噪音等。同时,为AISHELL-4的每次会议提供准确的录音和语音活动。这使研究人员能够探索会议处理的不同方面,从语音前端处理、语音识别和语音分化等个别任务,到多调模式模型和联合优化相关任务。鉴于用于多声学任务的大多数开放源数据集都是英语,AISHELL-4是唯一用于谈话的曼达林数据集,为语言界提供数据多样性的额外价值。我们还发布了一个实地研究框架。