In this paper, we propose a pipeline to find the number of speakers, as well as audios belonging to each of these now identified speakers in a source of audio data where number of speakers or speaker labels are not known a priori. We used this approach as a part of our Data Preparation pipeline for Speech Recognition in Indic Languages (https://github.com/Open-Speech-EkStep/vakyansh-wav2vec2-experimentation). To understand and evaluate the accuracy of our proposed pipeline, we introduce two metrics: Cluster Purity, and Cluster Uniqueness. Cluster Purity quantifies how "pure" a cluster is. Cluster Uniqueness, on the other hand, quantifies what percentage of clusters belong only to a single dominant speaker. We discuss more on these metrics in section \ref{sec:metrics}. Since we develop this utility to aid us in identifying data based on speaker IDs before training an Automatic Speech Recognition (ASR) model, and since most of this data takes considerable effort to scrape, we also conclude that 98\% of data gets mapped to the top 80\% of clusters (computed by removing any clusters with less than a fixed number of utterances -- we do this to get rid of some very small clusters and use this threshold as 30), in the test set chosen.
翻译:在本文中,我们建议建立一个管道,以便在一个音频数据源中查找发言者的人数,以及属于其中每个现在确定的发言者的音频,因为在那里,发言者或发言者标签的数目是事先不为人知的。我们使用这一方法作为我们“承认英译语言语音”的数据准备管道的一部分(https://github.com/ Open-Speech-EkStep/vakyansh-vakyansh-wav2vec2-exeriation)。为了理解和评估我们提议的管道的准确性,我们引入了两个衡量标准:集群纯度和集群独特性。集纯度量度量化了一个组的“纯度”。另一方面,集群独特性量化了组中有多少组只属于一个主要发言者。我们在\ref{sec:量性}一节中更多地讨论这些计量标准,以便在培训自动语音识别模型之前,帮助我们根据发言者的识别数据,我们开发了两个衡量标准,并且由于大多数数据都花费了相当大的努力去“纯粹”一个组的“纯度。在另一组中,我们还得出结论,98* 将这一组的标数固定到这个组的标数在80个组中,比标数的组中的总数要降到80个组中。