多模式信息瓶颈:学习最起码的全模式和多模式代表制 (Multimodal Information Bottleneck: Learning Minimal Sufficient Unimodal and Multimodal Representations)

from arxiv, This paper is accepted by IEEE Transactions on Multimedia. This version addresses some mistakes and typos in the original paper. The appendix is available at https://github.com/TmacMai/Multimodal-Information-Bottleneck/blob/main/appendix.pdf

Learning effective joint embedding for cross-modal data has always been a focus in the field of multimodal machine learning. We argue that during multimodal fusion, the generated multimodal embedding may be redundant, and the discriminative unimodal information may be ignored, which often interferes with accurate prediction and leads to a higher risk of overfitting. Moreover, unimodal representations also contain noisy information that negatively influences the learning of cross-modal dynamics. To this end, we introduce the multimodal information bottleneck (MIB), aiming to learn a powerful and sufficient multimodal representation that is free of redundancy and to filter out noisy information in unimodal representations. Specifically, inheriting from the general information bottleneck (IB), MIB aims to learn the minimal sufficient representation for a given task by maximizing the mutual information between the representation and the target and simultaneously constraining the mutual information between the representation and the input data. Different from general IB, our MIB regularizes both the multimodal and unimodal representations, which is a comprehensive and flexible framework that is compatible with any fusion methods. We develop three MIB variants, namely, early-fusion MIB, late-fusion MIB, and complete MIB, to focus on different perspectives of information constraints. Experimental results suggest that the proposed method reaches state-of-the-art performance on the tasks of multimodal sentiment analysis and multimodal emotion recognition across three widely used datasets. The codes are available at \url{https://github.com/TmacMai/Multimodal-Information-Bottleneck}.

翻译：我们争辩说,在多式联运混合过程中,生成的多式联运嵌入可能是多余的,而歧视性的单式信息可能被忽视,这往往干扰准确的预测,并导致过度配置的更大风险。此外,单式表述还包含对跨模式动态学习产生不利影响的噪音信息。为此,我们引入了多式联运信息瓶颈(MIB),目的是学习一个强大和充足的多式联运代表形式,这种代表形式没有冗余,在单式表达形式中过滤噪音信息。具体地说,从一般信息瓶式(IB)中继承,IMB的目的是通过最大限度地增加代表性与目标之间的相互信息,同时限制代表性与输入数据之间的相互信息。不同于IB,我们引入了多式联运和单式表达形式,这是与任何融合方法兼容的全面和灵活的框架。我们开发了三个MIB变式变量,即MIB早期版本、IMB晚期/多式情感分析模式分析模式,目的是了解特定任务中最起码的代表权,同时限制代表形式和输入数据数据数据数据数据数据数据。IMB的拟议格式分析侧重于三种模式。