Turn-taking has played an essential role in structuring the regulation of a conversation. The task of identifying the main speaker (who is properly taking his/her turn of speaking) and the interrupters (who are interrupting or reacting to the main speaker's utterances) remains a challenging task. Although some prior methods have partially addressed this task, there still remain some limitations. Firstly, a direct association of Audio and Visual features may limit the correlations to be extracted due to different modalities. Secondly, the relationship across temporal segments helping to maintain the consistency of localization, separation, and conversation contexts is not effectively exploited. Finally, the interactions between speakers that usually contain the tracking and anticipatory decisions about the transition to a new speaker are usually ignored. Therefore, this work introduces a new Audio-Visual Transformer approach to the problem of localization and highlighting the main speaker in both audio and visual channels of a multi-speaker conversation video in the wild. The proposed method exploits different types of correlations presented in both visual and audio signals. The temporal audio-visual relationships across spatial-temporal space are anticipated and optimized via the self-attention mechanism in a Transformerstructure. Moreover, a newly collected dataset is introduced for the main speaker detection. To the best of our knowledge, it is one of the first studies that is able to automatically localize and highlight the main speaker in both visual and audio channels in multi-speaker conversation videos.
翻译:在规范对话方面,翻转发挥了不可或缺的作用,确定主讲人(他/她恰如其分)和中断人(他/她的转接他/她的转接他/她)的任务仍然是一项艰巨的任务,尽管以前的一些方法部分地处理了这项任务,但仍存在一些限制。首先,视听特征的直接结合可能会限制由于不同方式而要提取的关联性。第二,不同时间段之间的关系有助于保持本地化、分离和对话背景的一致性,没有有效地加以利用。最后,通常包含向新演讲人过渡的跟踪和预测决定的发言者之间互动通常被忽视。因此,这项工作对本地化问题采用了新的视听变异器方法,并在视听频道中突出主要发言人在野外播放的多语种对话视频视频。拟议方法利用视觉信号和音频信号中呈现的不同类型相互关系。在空间空间空间之间的时间视听关系中,通常包含关于向新演讲人过渡的跟踪和预测决定,因此,通常忽视这些发言者之间的相互作用,因此,这项工作提出了一种新的视听变异式变异式变换方式方法,这是从一个主知识的自动学习。