Speaker embedding extractors significantly influence the performance of clustering-based speaker diarisation systems. Conventionally, only one embedding is extracted from each speech segment. However, because of the sliding window approach, a segment easily includes two or more speakers owing to speaker change points. This study proposes a novel embedding extractor architecture, referred to as a high-resolution embedding extractor (HEE), which extracts multiple high-resolution embeddings from each speech segment. Hee consists of a feature-map extractor and an enhancer, where the enhancer with the self-attention mechanism is the key to success. The enhancer of HEE replaces the aggregation process; instead of a global pooling layer, the enhancer combines relative information to each frame via attention leveraging the global context. Extracted dense frame-level embeddings can each represent a speaker. Thus, multiple speakers can be represented by different frame-level features in each segment. We also propose an artificially generating mixture data training framework to train the proposed HEE. Through experiments on five evaluation sets, including four public datasets, the proposed HEE demonstrates at least 10% improvement on each evaluation set, except for one dataset, which we analyse that rapid speaker changes less exist.
翻译:发言人嵌入提取器对基于集群的发言者分化系统的性能有重大影响。 公约中,每个演讲段只提取一个嵌入器。 但是,由于滑动窗口方法,一个部分很容易包括两个或两个以上发言者,因为发言者的变换点。本研究报告建议了一个新的嵌入提取器结构,称为高分辨率嵌入提取器(HEEE),从每个演讲段提取多个高分辨率嵌入器。Hee是一个地貌图提取器和一个增强器,其中用自控机制增强的增强器是成功的关键。HEE的增强器取代了聚合过程;而不是一个全球集合层,增强器通过利用全球背景的关注将相关信息与每个框架结合起来。提取的密集框架层嵌入器可以代表一名发言者。因此,每个部分的不同框架级特性可以代表多个发言者。我们还提议一个人工生成混合数据培训框架来培训拟议的电子EE。通过五个评价组的实验,包括四个公共数据集,拟议的EEEEE显示每个评价组至少有10%的改进率,但每个评价组的快速分析则少一个数据组除外。