We propose a compact and effective framework to fuse multimodal features at multiple layers in a single network. The framework consists of two innovative fusion schemes. Firstly, unlike existing multimodal methods that necessitate individual encoders for different modalities, we verify that multimodal features can be learnt within a shared single network by merely maintaining modality-specific batch normalization layers in the encoder, which also enables implicit fusion via joint feature representation learning. Secondly, we propose a bidirectional multi-layer fusion scheme, where multimodal features can be exploited progressively. To take advantage of such scheme, we introduce two asymmetric fusion operations including channel shuffle and pixel shift, which learn different fused features with respect to different fusion directions. These two operations are parameter-free and strengthen the multimodal feature interactions across channels as well as enhance the spatial feature discrimination within channels. We conduct extensive experiments on semantic segmentation and image translation tasks, based on three publicly available datasets covering diverse modalities. Results indicate that our proposed framework is general, compact and is superior to state-of-the-art fusion frameworks.
翻译:我们提出一个在单一网络中多层次融合多式联运特点的紧凑而有效的框架。框架由两个创新的融合计划组成。首先,与现有的要求各编码者采用不同模式的多式联运方法不同,我们核实,在一个共同的单一网络中,仅仅通过在编码器中保持特定模式的批次正常化层就可以学习多式联运特征,这也通过共同特征代表学习而使隐含的融合成为可能。第二,我们提出一个双向多层次融合计划,可以逐步利用多式联运特征。为了利用这种计划,我们引入了两种不对称的融合行动,包括频道洗发和像素转移,它们学习不同融合方向的不同融合特征。这两种行动都是无参数的,加强了跨渠道的多式联运特征互动,并强化了各渠道的空间特征歧视。我们根据三个公开提供的不同模式的数据集,对语义分割和图像翻译任务进行了广泛的实验。结果显示,我们提议的框架是普遍的、紧凑的,优于最先进的融合框架。