The modeling of turn-taking in dialog can be viewed as the modeling of the dynamics of voice activity of the interlocutors. We extend prior work and define the predictive task of Voice Activity Projection, a general, self-supervised objective, as a way to train turn-taking models without the need of labeled data. We highlight a theoretical weakness with prior approaches, arguing for the need of modeling the dependency of voice activity events in the projection window. We propose four zero-shot tasks, related to the prediction of upcoming turn-shifts and backchannels, and show that the proposed model outperforms prior work.
翻译:对话中转手模式可被视为对话者声音活动动态的模型。我们延长了先前的工作,并界定了语音活动预测的预测任务,这是一个一般性的、自我监督的目标,可以用来培训不需要贴标签的数据的转手模式。我们强调以往方法的理论弱点,主张需要建模投影窗口中语音活动事件的依赖性。我们提出了四项零结果任务,涉及预测即将到来的转手和后通道,并表明拟议的模型优于先前的工作。