Multimodal pre-training breaks down the modality barriers and allows the individual modalities to be mutually augmented with information, resulting in significant advances in representation learning. However, graph modality, as a very general and important form of data, cannot be easily interacted with other modalities because of its non-regular nature. In this paper, we propose MMGA (Multimodal learning with Graph Alignment), a novel multimodal pre-training framework to incorporate information from graph (social network), image and text modalities on social media to enhance user representation learning. In MMGA, a multi-step graph alignment mechanism is proposed to add the self-supervision from graph modality to optimize the image and text encoders, while using the information from the image and text modalities to guide the graph encoder learning. We conduct experiments on the dataset crawled from Instagram. The experimental results show that MMGA works well on the dataset and improves the fans prediction task's performance. We release our dataset, the first social media multimodal dataset with graph, of 60,000 users labeled with specific topics based on 2 million posts to facilitate future research.
翻译:多式培训前的多式学习打破了模式障碍,使个人模式与信息相辅相成,从而在代表性学习方面取得显著进展;然而,图表模式作为一种非常一般和重要的数据形式,由于其非常规性质,不容易与其他模式互动;在本文件中,我们提议采用MMGA(与图形对齐相结合的多式学习)这一新的多式培训前框架,以纳入来自图表(社会网络)、社会媒体图像和文本模式的信息,从而增强用户代表性学习;在MMGA中,建议采用多步图调整机制,从图形模式中增加自我监督功能,优化图像和文本编码器,同时利用图像和文本模式中的信息指导图形编码器的学习;我们在Instagram上对数据集进行实验;实验结果表明,MMGGA在数据集上运作良好,改进了粉丝预测任务的业绩;我们公布了我们的数据集,即第一个带有图表的社会媒体多式数据集,60 000名用户标有200万个具体专题,以200万个员额为基础,以促进未来的研究。