Multimodal classification is a core task in human-centric machine learning. We observe that information is highly complementary across modalities, thus unimodal information can be drastically sparsified prior to multimodal fusion without loss of accuracy. To this end, we present Sparse Fusion Transformers (SFT), a novel multimodal fusion method for transformers that performs comparably to existing state-of-the-art methods while having greatly reduced memory footprint and computation cost. Key to our idea is a sparse-pooling block that reduces unimodal token sets prior to cross-modality modeling. Evaluations are conducted on multiple multimodal benchmark datasets for a wide range of classification tasks. State-of-the-art performance is obtained on multiple benchmarks under similar experiment conditions, while reporting up to six-fold reduction in computational cost and memory requirements. Extensive ablation studies showcase our benefits of combining sparsification and multimodal learning over naive approaches. This paves the way for enabling multimodal learning on low-resource devices.
翻译:多式分类是以人为本的机器学习的一项核心任务。我们观察到,信息是多种模式之间的高度互补,因此单式信息可以在多式联运融合之前在不丧失准确性的情况下大幅度地封隔,为此,我们为变压器介绍了一种新颖的分散式融合变异器(SFT),这是一种用于变压器的新的多式融合方法,它与现有最先进的方法相当,同时大大减少了记忆足迹和计算成本。我们想法的关键是一个稀疏的集合块,它减少了跨式模型之前的单式代用品。对多种多式联运基准数据集进行了评估,用于广泛的分类任务。在类似试验条件下,在多个基准下取得了最新水平的绩效,同时报告了计算成本和记忆要求的减少六倍。广泛的膨胀研究展示了我们结合宽度和多式学习而不是天真的方法的好处。这为在低资源设备上进行多式学习铺平铺平了道路。