In this work, we present a multi-modal model for commercial product classification, that combines features extracted by multiple neural network models from textual (CamemBERT and FlauBERT) and visual data (SE-ResNeXt-50), using simple fusion techniques. The proposed method significantly outperformed the unimodal models' performance and the reported performance of similar models on our specific task. We did experiments with multiple fusing techniques and found, that the best performing technique to combine the individual embedding of the unimodal network is based on combining concatenation and averaging the feature vectors. Each modality complemented the shortcomings of the other modalities, demonstrating that increasing the number of modalities can be an effective method for improving the performance of multi-label and multimodal classification problems.
翻译:在这项工作中,我们提出了一个商业产品分类的多模式模型,其中结合了由多种神经网络模型从文本(CammBERT和FlauBERT)和视觉数据(SE-ResNeXt-50)中利用简单的聚合技术提取的特征,拟议的方法大大优于单式模型的性能和所报告的关于我们具体任务的类似模型的性能。我们用多种引信技术进行了实验,发现将单式网络的单个嵌入结合起来的最佳技术是以合并和平均地貌矢量为基础的。每一种模式都补充了其他模式的缺点,表明增加模式的数量可以成为改进多标签和多式分类问题绩效的有效方法。