Previous works have shown that spatial location information can be complementary to speaker embeddings for a speaker diarisation task. However, the models used often assume that speakers are fairly stationary throughout a meeting. This paper proposes to relax this assumption, by explicitly modelling the movements of speakers within an Agglomerative Hierarchical Clustering (AHC) diarisation framework. Kalman filters, which track the locations of speakers, are used to compute log-likelihood ratios that contribute to the cluster affinity computations for the AHC merging and stopping decisions. Experiments show that the proposed approach is able to yield improvements on a Microsoft rich meeting transcription task, compared to methods that do not use location information or that make stationarity assumptions.
翻译:先前的著作表明,空间位置信息可以补充发言者的二分化任务,然而,所使用的模型往往假定发言者在整个会议期间相当固定,本文件提议通过在集聚式高分层(AHC)二分化框架内明确模拟发言者的移动来放松这一假设。跟踪发言者位置的Kalman过滤器用来计算有助于为AHC合并和停止决定进行分组亲近性计算的日志比。实验表明,与不使用定位信息或作出定点假设的方法相比,拟议的方法能够改进微软富集的会议抄录任务。