Active speaker detection plays a vital role in human-machine interaction. Recently, a few end-to-end audiovisual frameworks emerged. However, these models' inference time was not explored and are not applicable for real-time applications due to their complexity and large input size. In addition, they explored a similar feature extraction strategy that employs the ConvNet on audio and visual inputs. This work presents a novel two-stream end-to-end framework fusing features extracted from images via VGG-M with raw Mel Frequency Cepstrum Coefficients features extracted from the audio waveform. The network has two BiGRU layers attached to each stream to handle each stream's temporal dynamic before fusion. After fusion, one BiGRU layer is attached to model the joint temporal dynamics. The experiment result on the AVA-ActiveSpeaker dataset indicates that our new feature extraction strategy shows more robustness to noisy signals and better inference time than models that employed ConvNet on both modalities. The proposed model predicts within 44.41 ms, which is fast enough for real-time applications. Our best-performing model attained 88.929% accuracy, nearly the same detection result as state-of-the-art -work.
 翻译:活跃的扬声器探测在人体机器互动中发挥着关键作用。 最近, 出现了几个端到端的视听框架。 但是, 这些模型的推断时间没有被探索, 并且由于这些模型的复杂性和大量投入大小, 不适用于实时应用。 此外, 它们探索了类似的特征提取策略, 使用音频和视觉输入的ConvNet 。 这项工作展示了一个新型的双流端到端框架, 将图像通过 VGG- M 提取出, 原始的 Mel 频率 Cepstrum Covaly 功能从声音波形中提取出来。 网络有两层BiGRU, 用于处理每个流在聚合前的时间动态。 在聚合后, 一个 BiGRU 层被附加来模拟联合时间动态 。 AVA- ApectiveSpeaker 数据集的实验结果表明, 我们的新特征提取策略比 ConvNet 在两种模式上使用的模型更稳健。 拟议的模型预测在44. 41 毫秒内, 足以实时应用。 我们的最佳模型达到了88.929 的准确度, 接近于状态检测结果。