This report describes our approach for the Audio-Visual Diarization (AVD) task of the Ego4D Challenge 2022. Specifically, we present multiple technical improvements over the official baselines. First, we improve the detection performance of the camera wearer's voice activity by modifying the training scheme of its model. Second, we discover that an off-the-shelf voice activity detection model can effectively remove false positives when it is applied solely to the camera wearer's voice activities. Lastly, we show that better active speaker detection leads to a better AVD outcome. Our final method obtains 65.9% DER on the test set of Ego4D, which significantly outperforms all the baselines. Our submission achieved 1st place in the Ego4D Challenge 2022.
翻译:本报告介绍了我们执行Ego4D挑战2022的视听分解(AVD)任务的方法。具体地说,我们介绍了官方基线的多项技术改进。首先,我们通过修改其模型的培训计划,改进了摄影机磨损器语音活动的检测性能。第二,我们发现,当光机磨损器的语音活动检测模型仅用于摄影机的语音活动时,现成的语音活动检测模型可以有效地消除虚假的阳性。最后,我们显示,更积极的语音检测可导致更好的AVD结果。我们的最后方法在Ego4D测试中获得了65.9%的DER,大大超越了所有基线。我们的呈件在Ego4D挑战2022中达到了第1位。