Speaker diarization is a task to label an audio or video recording with the identity of the speaker at each given time stamp. In this work, we propose a novel machine learning framework to conduct real-time multi-speaker diarization and recognition without prior registration and pretraining in a fully online and reinforcement learning setting. Our framework combines embedding extraction, clustering, and resegmentation into the same problem as an online decision-making problem. We discuss practical considerations and advanced techniques such as the offline reinforcement learning, semi-supervision, and domain adaptation to address the challenges of limited training data and out-of-distribution environments. Our approach considers speaker diarization as a fully online learning problem of the speaker recognition task, where the agent receives no pretraining from any training set before deployment, and learns to detect speaker identity on the fly through reward feedbacks. The paradigm of the reinforcement learning approach to speaker diarization presents an adaptive, lightweight, and generalizable system that is useful for multi-user teleconferences, where many people might come and go without extensive pre-registration ahead of time. Lastly, we provide a desktop application that uses our proposed approach as a proof of concept. To the best of our knowledge, this is the first approach to apply a reinforcement learning approach to the speaker diarization task.
翻译:在这项工作中,我们建议建立一个新型机器学习框架,在完全在线和强化学习环境中,不经事先登记和预先培训,在完全在线和强化学习环境中,进行实时多发言的对称和识别;我们的框架将提取、分组和分解同在线决策问题结合起来;我们讨论实际考虑和先进技术,如离线强化学习、半监督和域适应等,以应对有限培训数据和分配外环境的挑战;我们的方法认为,发言者的对称是语音识别任务的一个完全在线学习问题,即代理在部署之前没有接受任何培训的预培训,并学习通过奖励反馈探测飞行上的发言者身份;对发言者的对称化的强化学习方法的范式是一种适应性、轻量力和可推广的系统,对多用户电话会议有用,其中许多人可能来此,而且提前没有广泛的预先登记。最后,我们提供了一种桌面应用程序,即使用我们提出的强化任务概念的学习方法。