对话短短句发言人 Diariz化(CSD)任务:数据集、评价计量和基线 (The Conversational Short-phrase Speaker Diarization (CSSD) Task: Dataset, Evaluation Metric and Baselines)

The conversation scenario is one of the most important and most challenging scenarios for speech processing technologies because people in conversation respond to each other in a casual style. Detecting the speech activities of each person in a conversation is vital to downstream tasks, like natural language processing, machine translation, etc. People refer to the detection technology of "who speak when" as speaker diarization (SD). Traditionally, diarization error rate (DER) has been used as the standard evaluation metric of SD systems for a long time. However, DER fails to give enough importance to short conversational phrases, which are short but important on the semantic level. Also, a carefully and accurately manually-annotated testing dataset suitable for evaluating the conversational SD technologies is still unavailable in the speech community. In this paper, we design and describe the Conversational Short-phrases Speaker Diarization (CSSD) task, which consists of training and testing datasets, evaluation metric and baselines. In the dataset aspect, despite the previously open-sourced 180-hour conversational MagicData-RAMC dataset, we prepare an individual 20-hour conversational speech test dataset with carefully and artificially verified speakers timestamps annotations for the CSSD task. In the metric aspect, we design the new conversational DER (CDER) evaluation metric, which calculates the SD accuracy at the utterance level. In the baseline aspect, we adopt a commonly used method: Variational Bayes HMM x-vector system, as the baseline of the CSSD task. Our evaluation metric is publicly available at https://github.com/SpeechClub/CDER_Metric.

翻译：对话情景是语言处理技术中最重要的、最具挑战性的情景之一,因为人们在交谈中会以随意的方式相互响应。检测每个人在谈话中的演讲活动对于下游任务至关重要, 如自然语言处理、机器翻译等。人们提到“谁说话”的检测技术是语音分化(SD) 。传统上, diarization错误率(DER)被长期用作SD系统的标准评估指标。然而, DER没有足够重视简短的谈话词句,这些词句在语义层面上是短的,但很重要。另外, 演讲界仍然无法仔细和准确地对适合评价对话SDD技术的语音活动进行人工加注解测试。在本文中,我们设计和描述“谁说话时”的“谁说话时”的检测技术。在数据库中,尽管先前有180小时的交谈-MAIC-RADC数据集,但我们在20小时的个人对话中准备了一个有注释的测试数据集。在SDERC的常规基线中,我们用SDRB 数据库测试了S, 用于SDB 的常规的计算方法。