In this paper, we present AISHELL-4, a sizable real-recorded Mandarin speech dataset collected by 8-channel circular microphone array for speech processing in conference scenario. The dataset consists of 211 recorded meeting sessions, each containing 4 to 8 speakers, with a total length of 118 hours. This dataset aims to bride the advanced research on multi-speaker processing and the practical application scenario in three aspects. With real recorded meetings, AISHELL-4 provides realistic acoustics and rich natural speech characteristics in conversation such as short pause, speech overlap, quick speaker turn, noise, etc. Meanwhile, the accurate transcription and speaker voice activity are provided for each meeting in AISHELL-4. This allows the researchers to explore different aspects in meeting processing, ranging from individual tasks such as speech front-end processing, speech recognition and speaker diarization, to multi-modality modeling and joint optimization of relevant tasks. Given most open source dataset for multi-speaker tasks are in English, AISHELL-4 is the only Mandarin dataset for conversation speech, providing additional value for data diversity in speech community.
翻译:在本文中,我们介绍AISHELL-4,这是一个由8个频道圆环麦克风阵列收集的大量真实记录的普通话数据集,用于在会议场景中进行语音处理。数据集由211个记录的会议组成,每次会议有4至8个发言者,总共118小时。该数据集的目的是在三个方面为关于多声处理和实际应用设想的先进研究打下基础。在实际记录的会议中,AISHELL-4提供现实的声学和丰富的自然语音特征,如短暂暂停、发言重叠、语音转转转转、噪音等。同时,为AISHELL-4的每次会议提供准确的录音和语音活动。这使研究人员能够探索会议处理的不同方面,从发言前端处理、语音识别和语音分化等个别任务到多调模式模型和联合优化相关任务。鉴于用于多声任务的大多数开放源数据集是英文,AISHELL-4是唯一用于谈话的曼达林数据集,为语音社区提供数据多样性的额外价值。