Modern Web systems such as social media and e-commerce contain rich contents expressed in images and text. Leveraging information from multi-modalities can improve the performance of machine learning tasks such as classification and recommendation. In this paper, we propose the Cross-Modality Attention Contrastive Language-Image Pre-training (CMA-CLIP), a new framework which unifies two types of cross-modality attentions, sequence-wise attention and modality-wise attention, to effectively fuse information from image and text pairs. The sequence-wise attention enables the framework to capture the fine-grained relationship between image patches and text tokens, while the modality-wise attention weighs each modality by its relevance to the downstream tasks. In addition, by adding task specific modality-wise attentions and multilayer perceptrons, our proposed framework is capable of performing multi-task classification with multi-modalities. We conduct experiments on a Major Retail Website Product Attribute (MRWPA) dataset and two public datasets, Food101 and Fashion-Gen. The results show that CMA-CLIP outperforms the pre-trained and fine-tuned CLIP by an average of 11.9% in recall at the same level of precision on the MRWPA dataset for multi-task classification. It also surpasses the state-of-the-art method on Fashion-Gen Dataset by 5.5% in accuracy and achieves competitive performance on Food101 Dataset. Through detailed ablation studies, we further demonstrate the effectiveness of both cross-modality attention modules and our method's robustness against noise in image and text inputs, which is a common challenge in practice.
翻译:社交媒体和电子商务等现代网络系统包含在图像和文本中表达的丰富内容。 利用来自多种模式的信息可以改进机器学习任务( 如分类和建议)的绩效。 在本文件中, 我们建议采用跨模式关注与语言图像对比培训前( CMA- CLIP), 这个新框架将两种类型的跨模式关注、 顺序关注和模式关注统一起来, 有效地融合图像和文本配对的信息。 排序式关注使框架能够捕捉图像补丁和文本符号之间的细微差异关系, 而模式式详细关注会因与下游任务的相关性而使每种模式的影响力受到权衡。 此外, 通过添加特定任务模式关注和多层次感知器, 我们的拟议框架能够用多种模式进行多层次的多重任务分类。 我们用一个主要更新网站产品属性( MRWPA) 和两个公共数据集( Food101) 和 Fashashional- gen) 来显示, CMA- CLIP 的高级性能性能 — Greadalal- developal A laxalalalalalalalalalal laction a dal- dal- dal- lax lax lax lax lax lax lax lax lax- dal- dal- dal- dal- dal- dal- dal- dal- dal- daldaldaldal- daldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldal ladaldaldaldaldal ladaldaldal ladaldaldaldaldaldaldaldaldaldaldaldaldal ladaldaldaldal ladaldaldal ladal ladaldaldaldaldaldaldaldald lad ladal ladaldaldaldaldaldaldaldaldal ladaldaldaldaldaldaldaldaldaldal ladaldal ladal