Speakers may move around while diarisation is being performed. When a microphone array is used, the instantaneous locations of where the sounds originated from can be estimated, and previous investigations have shown that such information can be complementary to speaker embeddings in the diarisation task. However, these approaches often assume that speakers are fairly stationary throughout a meeting. This paper relaxes this assumption, by proposing to explicitly track the movements of speakers while jointly performing diarisation within a unified model. A state-space model is proposed, where the hidden state expresses the identity of the current active speaker and the predicted locations of all speakers. The model is implemented as a particle filter. Experiments on a Microsoft rich meeting transcription task show that the proposed joint location tracking and diarisation approach is able to perform comparably with other methods that use location information.
翻译:使用麦克风阵列时,声源的瞬时位置可以估计,而以往的调查显示,这种信息可以补充将发言者嵌入二分法的任务,但是,这些方法往往假定发言者在整个会议期间相当固定。本文放宽了这一假设,提议明确跟踪发言者的移动情况,同时在一个统一的模型内联合进行二分法。提出了州空间模型,其中隐藏状态表示当前活跃发言者的身份和所有发言者的预测位置。该模型作为粒子过滤器实施。对微软富集的会议记录处理任务进行的实验显示,拟议的联合地点跟踪和分解方法能够与使用定位信息的其他方法进行比较。