The task of multimodal learning has seen a growing interest recently as it allows for training neural architectures based on different modalities such as vision, text, and audio. One challenge in training such models is that they need to jointly learn semantic concepts and their relationships across different input representations. Capsule networks have been shown to perform well in context of capturing the relation between low-level input features and higher-level concepts. However, capsules have so far mainly been used only in small-scale fully supervised settings due to the resource demand of conventional routing algorithms. We present a new multimodal capsule network that allows us to leverage the strength of capsules in the context of a multimodal learning framework on large amounts of video data. To adapt the capsules to large-scale input data, we propose a novel routing by self-attention mechanism that selects relevant capsules which are then used to generate a final joint multimodal feature representation. This allows not only for robust training with noisy video data, but also to scale up the size of the capsule network compared to traditional routing methods while still being computationally efficient. We evaluate the proposed architecture by pretraining it on a large-scale multimodal video dataset and applying it on four datasets in two challenging downstream tasks. Results show that the proposed multimodal capsule network is not only able to improve results compared to other routing techniques, but also achieves competitive performance on the task of multimodal learning.
翻译:多式联运学习的任务最近引起了越来越多的兴趣,因为它允许在愿景、文本和音频等不同模式的基础上对神经结构进行培训。培训这些模型的一个挑战是,它们需要共同学习语义概念和不同投入代表之间的关系。Capsule网络在捕捉低层次投入特征和较高层次概念之间的关系方面表现良好。然而,由于传统路由算法的资源需求,胶囊目前主要用于小规模的全面监督环境中。我们提出了一个新的多式联运胶囊网络,使我们能够在大量视频数据的多边学习框架中利用胶囊的力量。为了使胶囊适应大规模投入数据,我们提议通过自留机制选择相关的胶囊,选择相关的胶囊,然后用来产生最后的联合多式联运特征代表。由于传统路由算法的资源需求,因此,胶囊目前主要用于对热量视频数据进行强有力的培训,而且比照传统的路由方法扩大胶囊网络网络的大小。我们通过在大规模视频数据数据传输方面进行两项拟议的结构评估,而不是通过对具有竞争力的多层次的多式联运数据测试,因此只能对高层次的多式联运数据系统进行测试,因此只能对高层次的多式联运数据进行测试。