CMA-CLIP: 图像文字分类跨模式注意 CLIP (CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification)

Modern Web systems such as social media and e-commerce contain rich contents expressed in images and text. Leveraging information from multi-modalities can improve the performance of machine learning tasks such as classification and recommendation. In this paper, we propose the Cross-Modality Attention Contrastive Language-Image Pre-training (CMA-CLIP), a new framework which unifies two types of cross-modality attentions, sequence-wise attention and modality-wise attention, to effectively fuse information from image and text pairs. The sequence-wise attention enables the framework to capture the fine-grained relationship between image patches and text tokens, while the modality-wise attention weighs each modality by its relevance to the downstream tasks. In addition, by adding task specific modality-wise attentions and multilayer perceptrons, our proposed framework is capable of performing multi-task classification with multi-modalities. We conduct experiments on a Major Retail Website Product Attribute (MRWPA) dataset and two public datasets, Food101 and Fashion-Gen. The results show that CMA-CLIP outperforms the pre-trained and fine-tuned CLIP by an average of 11.9% in recall at the same level of precision on the MRWPA dataset for multi-task classification. It also surpasses the state-of-the-art method on Fashion-Gen Dataset by 5.5% in accuracy and achieves competitive performance on Food101 Dataset. Through detailed ablation studies, we further demonstrate the effectiveness of both cross-modality attention modules and our method's robustness against noise in image and text inputs, which is a common challenge in practice.

翻译：社交媒体和电子商务等现代网络系统包含在图像和文本中表达的丰富内容。利用来自多种模式的信息可以改进机器学习任务( 如分类和建议)的绩效。在本文件中, 我们建议采用跨模式关注与语言图像对比培训前( CMA- CLIP), 这个新框架将两种类型的跨模式关注、顺序关注和模式关注统一起来, 有效地融合图像和文本配对的信息。排序式关注使框架能够捕捉图像补丁和文本符号之间的细微差异关系, 而模式式详细关注会因与下游任务的相关性而使每种模式的影响力受到权衡。此外, 通过添加特定任务模式关注和多层次感知器, 我们的拟议框架能够用多种模式进行多层次的多重任务分类。我们用一个主要更新网站产品属性( MRWPA) 和两个公共数据集( Food101) 和 Fashashional- gen) 来显示, CMA- CLIP 的高级性能性能 — Greadalal- developal A laxalalalalalalalalalal laction a dal- dal- dal- lax lax lax lax lax lax lax lax lax- dal- dal- dal- dal- dal- dal- dal- dal- dal- daldaldaldal- daldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldal ladaldaldaldaldal ladaldaldal ladaldaldaldaldaldaldaldaldaldaldaldaldal ladaldaldaldal ladaldaldal ladal ladaldaldaldaldaldaldaldald lad ladal ladaldaldaldaldaldaldaldaldal ladaldaldaldaldaldaldaldaldaldal ladaldal ladal

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

《图Transformer网络与语音识别》Facebook语音大牛Awni Hannun，附121页Slides与视频

专知会员服务

33+阅读 · 2021年6月26日

【NeurIPS 2020】对比学习全局和局部医学图像分割特征

专知会员服务

44+阅读 · 2020年10月20日

【CVPR2020】语义增强的场景文本识别的编码-解码器框架，SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition

专知会员服务

25+阅读 · 2020年5月22日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日