There is increasing interest in the use of multimodal data in various web applications, such as digital advertising and e-commerce. Typical methods for extracting important information from multimodal data rely on a mid-fusion architecture that combines the feature representations from multiple encoders. However, as the number of modalities increases, several potential problems with the mid-fusion model structure arise, such as an increase in the dimensionality of the concatenated multimodal features and missing modalities. To address these problems, we propose a new concept that considers multimodal inputs as a set of sequences, namely, deep multimodal sequence sets (DM$^2$S$^2$). Our set-aware concept consists of three components that capture the relationships among multiple modalities: (a) a BERT-based encoder to handle the inter- and intra-order of elements in the sequences, (b) intra-modality residual attention (IntraMRA) to capture the importance of the elements in a modality, and (c) inter-modality residual attention (InterMRA) to enhance the importance of elements with modality-level granularity further. Our concept exhibits performance that is comparable to or better than the previous set-aware models. Furthermore, we demonstrate that the visualization of the learned InterMRA and IntraMRA weights can provide an interpretation of the prediction results.
翻译:在数字广告和电子商务等各种网络应用中,对使用多式联运数据的兴趣日益浓厚。从多式联运数据中提取重要信息的典型方法依赖于一个中间聚合结构,它综合了多个编码器的特征表示。然而,随着模式数量的增加,中集模型结构中出现若干潜在问题,例如集成多式联运特点的维度增加和缺失模式。为解决这些问题,我们提出了一个新概念,将多式联运投入视为一系列序列,即深多式联运序列集($2,2美元S%2美元)。我们的成套认知概念由三个组成部分组成,其中囊括多种模式之间的关系:(a) 以BERT为基础的编码器,处理序列中各要素之间的顺序和内部顺序;(b) 内部模式遗留关注(IntertraMRA),以了解模式中各要素的重要性;(c) 多式联运残余关注(IntertramMRA),以进一步增强模式级颗粒性要素的重要性。我们的概念展示了我们所了解的图像模型的可比先前的更精确性。