Multimodal learning from document data has achieved great success lately as it allows to pre-train semantically meaningful features as a prior into a learnable downstream approach. In this paper, we approach the document classification problem by learning cross-modal representations through language and vision cues, considering intra- and inter-modality relationships. Instead of merging features from different modalities into a common representation space, the proposed method exploits high-level interactions and learns relevant semantic information from effective attention flows within and across modalities. The proposed learning objective is devised between intra- and inter-modality alignment tasks, where the similarity distribution per task is computed by contracting positive sample pairs while simultaneously contrasting negative ones in the common feature representation space}. Extensive experiments on public document classification datasets demonstrate the effectiveness and the generalization capacity of our model on both low-scale and large-scale datasets.
翻译:从文件数据中多式学习,最近取得了巨大成功,因为它允许预先将具有实际意义的语义特征作为可学习的下游方法,作为先入为主的先入为主的方法。在本文件中,我们处理文件分类问题的方式是通过语言和视觉提示学习跨式表达方式,考虑到内部和不同模式之间的关系。拟议方法不是将不同模式的特征合并成共同的表达空间,而是利用高级别互动,从各种模式内和不同模式内的有效关注流中学习相关的语义信息。拟议的学习目标是在内部和跨模式协调任务之间设计的,即通过签订正样样对来计算每项任务的相似性分布,同时对比共同特征代表空间中的负性分布。关于公共文件分类数据集的广泛实验显示了我们关于低尺度和大尺度数据集的模式的有效性和普遍化能力。