Conventional audio-visual models have independent audio and video branches. We design a unified model for audio and video processing called Unified Audio-Visual Model (UAVM). In this paper, we describe UAVM, report its new state-of-the-art audio-visual event classification accuracy of 65.8% on VGGSound, and describe the intriguing properties of the model.
翻译:常规视听模型有独立的视听分支,我们设计了一个统一的视听处理模型,称为“统一视听模型 ” ( UVAVI ) 。 在本文中,我们描述UVAVI,报告其在VGGSound上最新的最先进的视听事件分类精确度为65.8%,并描述该模型的引人入胜的特性。