Majority of speech signals across different scenarios are never available with well-defined audio segments containing only a single speaker. A typical conversation between two speakers consists of segments where their voices overlap, interrupt each other or halt their speech in between multiple sentences. Recent advancements in diarization technology leverage neural network-based approaches to improvise multiple subsystems of speaker diarization system comprising of extracting segment-wise embedding features and detecting changes in the speaker during conversation. However, to identify speaker through clustering, models depend on methodologies like PLDA to generate similarity measure between two extracted segments from a given conversational audio. Since these algorithms ignore the temporal structure of conversations, they tend to achieve a higher Diarization Error Rate (DER), thus leading to misdetections both in terms of speaker and change identification. Therefore, to compare similarity of two speech segments both independently and sequentially, we propose a Bi-directional Long Short-term Memory network for estimating the elements present in the similarity matrix. Once the similarity matrix is generated, Agglomerative Hierarchical Clustering (AHC) is applied to further identify speaker segments based on thresholding. To evaluate the performance, Diarization Error Rate (DER%) metric is used. The proposed model achieves a low DER of 34.80% on a test set of audio samples derived from ICSI Meeting Corpus as compared to traditional PLDA based similarity measurement mechanism which achieved a DER of 39.90%.
翻译:两种发言者之间的典型对话由其声音相互重叠、相互干扰或停止在多个句子之间发言的部分组成。二分化技术最近的进展导致神经网络型方法,即音频分化系统的多个子子系统即兴化,包括提取分段嵌入功能和在谈话期间探测发言者的变化。然而,为了通过集群识别发言者,模型取决于像PLDA这样的方法,在从某一对话音频中提取的两个传统区段之间产生相似度量度。由于这些算法忽略了对话的时间结构,它们往往达到更高的分解错误率(DER),从而导致在扬声器和更改识别方面出现误差。因此,为了独立和顺序比较音频分分分分分分化系统的多个次子系统,我们提议建立一个双向长的超时程记忆网络,用于估算类似矩阵中的要素。一旦生成类似矩阵,则将Aglomormal 高级组合(AHC)应用于进一步确定基于临界值的标定的语音错误分区段段。DERC的测算模型是用来测算的底值。