In challenging real-life conditions such as extreme head-pose, occlusions, and low-resolution images where the visual information fails to estimate visual attention/gaze direction, audio signals could provide important and complementary information. In this paper, we explore if audio-guided coarse head-pose can further enhance visual attention estimation performance for non-prolific faces. Since it is difficult to annotate audio signals for estimating the head-pose of the speaker, we use off-the-shelf state-of-the-art models to facilitate cross-modal weak-supervision. During the training phase, the framework learns complementary information from synchronized audio-visual modality. Our model can utilize any of the available modalities i.e. audio, visual or audio-visual for task-specific inference. It is interesting to note that, when AV-Gaze is tested on benchmark datasets with these specific modalities, it achieves competitive results on multiple datasets, while being highly adaptive toward challenging scenarios.
翻译:在极具挑战性的现实环境中,如极端头部、隔离和低分辨率图像,视觉信息无法估计视觉注意力/凝视方向,听觉信号可以提供重要和互补的信息。在本文中,我们探讨听觉制粗粗头部是否能够进一步加强对非大面孔的视觉注意力估计性能。由于很难对音频信号进行注释,以估计发言者的头部,我们使用现成的最先进的模型来便利跨式微弱监督。在培训阶段,框架从同步的视听模式中学习补充信息。我们的模型可以利用任何可用的模式,如音频、视觉或视听,进行具体任务的推理。值得注意的是,当AV-Gaze用这些具体模式测试基准数据集时,它在多个数据集上取得了竞争性的结果,同时高度适应挑战性情景。