In a noisy conversation environment such as a dinner party, people often exhibit selective auditory attention, or the ability to focus on a particular speaker while tuning out others. Recognizing who somebody is listening to in a conversation is essential for developing technologies that can understand social behavior and devices that can augment human hearing by amplifying particular sound sources. The computer vision and audio research communities have made great strides towards recognizing sound sources and speakers in scenes. In this work, we take a step further by focusing on the problem of localizing auditory attention targets in egocentric video, or detecting who in a camera wearer's field of view they are listening to. To tackle the new and challenging Selective Auditory Attention Localization problem, we propose an end-to-end deep learning approach that uses egocentric video and multichannel audio to predict the heatmap of the camera wearer's auditory attention. Our approach leverages spatiotemporal audiovisual features and holistic reasoning about the scene to make predictions, and outperforms a set of baselines on a challenging multi-speaker conversation dataset. Project page: https://fkryan.github.io/saal
翻译:在嘈杂的对话环境中,例如晚宴,人们经常表现出选择性听觉注意力,即能够集中注意力在特定的讲话者身上,同时调整其他声源的音量。识别一个人在对话中听什么是开发可以理解社交行为的技术以及可以放大特定声源的设备的关键。计算机视觉和音频研究社区在场景中识别声源和讲话者方面取得了很大的进展。在这项工作中,我们更进一步地关注了在自我中心视频中本地化听觉注意力目标的问题,即检测摄像头佩戴者视野中的听众。为了解决这个新的和具有挑战性的选择性听觉注意力定位问题,我们提出了一种端到端的深度学习方法,使用自我中心视频和多通道音频来预测摄像头佩戴者的听觉注意力热图。“我们的方法利用时空音频视觉特征和场景的整体推理进行预测,并在具有挑战性的多说话者对话数据集上优于一组基线。 项目页面:https://fkryan.github.io/saaf