With the rapid growth of social media platforms, users are sharing billions of multimedia posts containing audio, images, and text. Researchers have focused on building autonomous systems capable of processing such multimedia data to solve challenging multimodal tasks including cross-modal retrieval, matching, and verification. Existing works use separate networks to extract embeddings of each modality to bridge the gap between them. The modular structure of their branched networks is fundamental in creating numerous multimodal applications and has become a defacto standard to handle multiple modalities. In contrast, we propose a novel single-branch network capable of learning discriminative representation of unimodal as well as multimodal tasks without changing the network. An important feature of our single-branch network is that it can be trained either using single or multiple modalities without sacrificing performance. We evaluated our proposed single-branch network on the challenging multimodal problem (face-voice association) for cross-modal verification and matching tasks with various loss formulations. Experimental results demonstrate the superiority of our proposed single-branch network over the existing methods in a wide range of experiments. Code: https://github.com/msaadsaeed/SBNet
翻译:随着社交媒体平台的迅速发展,用户正在分享数十亿个包含音频、图像和文字的多媒体站点;研究人员侧重于建立自主系统,能够处理这些多媒体数据,解决具有挑战性的多式联运任务,包括跨模式检索、匹配和核查;现有工作使用单独的网络,提取每种模式的嵌入,以弥合它们之间的差距;其分支网络的模块结构对于创造多种多式联运应用程序至关重要,并已成为处理多种模式的实实在在标准;相比之下,我们提议建立一个新型的单一部门网络,能够在不改变网络的情况下学习单式和多式联运任务的歧视性代表性。我们的单一部门网络的一个重要特征是,可以在不牺牲业绩的情况下,使用单一或多种模式对其进行培训。我们评估了我们关于具有挑战性的多式联运问题的拟议单部门网络(面音协会),以进行跨模式核查,并将各项任务与各种损失表述相匹配。实验结果表明,我们拟议的单一部门网络在广泛的实验中优于现有方法。代码:https://github.com/msaadsaeed/SBSBSB。</s>