通过图集和图集融合,分析通过图集和图集融合产生的不结盟多式序列 (Analyzing Unaligned Multimodal Sequence via Graph Convolution and Graph Pooling Fusion)

In this paper, we study the task of multimodal sequence analysis which aims to draw inferences from visual, language and acoustic sequences. A majority of existing works generally focus on aligned fusion, mostly at word level, of the three modalities to accomplish this task, which is impractical in real-world scenarios. To overcome this issue, we seek to address the task of multimodal sequence analysis on unaligned modality sequences which is still relatively underexplored and also more challenging. Recurrent neural network (RNN) and its variants are widely used in multimodal sequence analysis, but they are susceptible to the issues of gradient vanishing/explosion and high time complexity due to its recurrent nature. Therefore, we propose a novel model, termed Multimodal Graph, to investigate the effectiveness of graph neural networks (GNN) on modeling multimodal sequential data. The graph-based structure enables parallel computation in time dimension and can learn longer temporal dependency in long unaligned sequences. Specifically, our Multimodal Graph is hierarchically structured to cater to two stages, i.e., intra- and inter-modal dynamics learning. For the first stage, a graph convolutional network is employed for each modality to learn intra-modal dynamics. In the second stage, given that the multimodal sequences are unaligned, the commonly considered word-level fusion does not pertain. To this end, we devise a graph pooling fusion network to automatically learn the associations between various nodes from different modalities. Additionally, we define multiple ways to construct the adjacency matrix for sequential data. Experimental results suggest that our graph-based model reaches state-of-the-art performance on two benchmark datasets.

翻译：在本文中,我们研究了多式联运序列分析的任务,其目的是从视觉、语言和声学序列中得出推论。大多数现有工作一般都侧重于将三种模式的组合(大多在文字层面)与完成这项任务的三种模式相匹配,这在现实世界情景中是不切实际的。为了克服这一问题,我们寻求解决关于不匹配模式序列的多式联运序列分析的任务,这些模式仍然相对探索不足,也更具挑战性。经常性神经网络及其变体在多式联运序列分析中广泛使用,但由于其经常性性质,这些变体很容易受到渐变消失/爆炸和高时间复杂性问题的影响。因此,我们提出了一个新的模型,称为多模式图,以调查图形神经网络(GNN)在模拟多式联运数据方面的有效性。基于图形的结构使得在时间层面进行平行的计算,并且可以在长期不匹配的序列中学习更长的时间依赖性。具体地说,我们的多模式图结构结构结构结构分为两个阶段,即我们内部和内部模式的变异体动态学习结果。在第二个阶段,图中,我们内部和内部的变体网络运行方式是每个阶段学习双级的周期周期周期数据。