Multimodal sentiment analysis (MSA) finds extensive applications, but the presence of missing modalities in real-world environments requires researchers to enhance the robustness of models, often demanding significant efforts. Multimodal neural architecture search (MNAS) is a more efficient approach. However, current MNAS methods, while effective in integrating multi-level information, are incapable of simultaneously searching for optimal operations to extract modality-specific information. This weakens the robustness of the model in addressing diverse scenarios. Moreover, these methods also fall short in enhancing the capture of emotional cues. In this paper, we propose robust-sentiment multimodal neural architecture search (RMNAS) framework. Specifically, we utilize the Transformer as a unified architecture for various modalities and incorporate a search for token mixers to enhance the encoding capacity of individual modalities and improve robustness across diverse scenarios. Subsequently, we leverage BM-NAS to integrate multi-level information. Furthermore, we incorporate local sentiment variation trends to guide the token mixers computation, enhancing the model's ability to capture sentiment context. Experimental results demonstrate that our approach outperforms or competitively matches existing state-of-the-art approaches in incomplete multimodal learning, both in sentence-level and dialogue-level MSA tasks, without the need for knowledge of incomplete learning.
翻译:暂无翻译