A real-world application or setting involves interaction between different modalities (e.g., video, speech, text). In order to process the multimodal information automatically and use it for an end application, Multimodal Representation Learning (MRL) has emerged as an active area of research in recent times. MRL involves learning reliable and robust representations of information from heterogeneous sources and fusing them. However, in practice, the data acquired from different sources are typically noisy. In some extreme cases, a noise of large magnitude can completely alter the semantics of the data leading to inconsistencies in the parallel multimodal data. In this paper, we propose a novel method for multimodal representation learning in a noisy environment via the generalized product of experts technique. In the proposed method, we train a separate network for each modality to assess the credibility of information coming from that modality, and subsequently, the contribution from each modality is dynamically varied while estimating the joint distribution. We evaluate our method on two challenging benchmarks from two diverse domains: multimodal 3D hand-pose estimation and multimodal surgical video segmentation. We attain state-of-the-art performance on both benchmarks. Our extensive quantitative and qualitative evaluations show the advantages of our method compared to previous approaches.
翻译:现实应用或设置涉及不同模式(例如视频、语音、文本)之间的相互作用; 为了自动处理多式联运信息并将其用于最终应用,现代代表学习(MRL)最近已成为一个积极的研究领域; 多边代表学习(MRL)涉及学习来自不同来源的可靠和有力的信息,并粉碎这些信息; 然而,在实践中,从不同来源获得的数据通常很吵; 在一些极端的情况下,一个巨大的噪音可以彻底改变导致平行多式联运数据不一致的数据的语义。 在本文件中,我们提出一种新颖的方法,通过专家技术的普遍产品,在吵闹的环境中进行多式联运代表学习。 在拟议方法中,我们为每个模式培训一个单独的网络,以评估来自该模式的信息的可信度,随后,每种模式的贡献在估计联合分布时变化很大。 我们从两个不同领域评估了我们两个挑战性基准的方法:3D多式联运手势估计和多式外科手术视频分割。我们在两个基准上都取得了最先进的业绩。 我们的广泛定量和定性评价显示了我们方法与以往方法相比的优势。