With the prosperity of e-commerce industry, various modalities, e.g., vision and language, are utilized to describe product items. It is an enormous challenge to understand such diversified data, especially via extracting the attribute-value pairs in text sequences with the aid of helpful image regions. Although a series of previous works have been dedicated to this task, there remain seldomly investigated obstacles that hinder further improvements: 1) Parameters from up-stream single-modal pretraining are inadequately applied, without proper jointly fine-tuning in a down-stream multi-modal task. 2) To select descriptive parts of images, a simple late fusion is widely applied, regardless of priori knowledge that language-related information should be encoded into a common linguistic embedding space by stronger encoders. 3) Due to diversity across products, their attribute sets tend to vary greatly, but current approaches predict with an unnecessary maximal range and lead to more potential false positives. To address these issues, we propose in this paper a novel approach to boost multi-modal e-commerce attribute value extraction via unified learning scheme and dynamic range minimization: 1) Firstly, a unified scheme is designed to jointly train a multi-modal task with pretrained single-modal parameters. 2) Secondly, a text-guided information range minimization method is proposed to adaptively encode descriptive parts of each modality into an identical space with a powerful pretrained linguistic model. 3) Moreover, a prototype-guided attribute range minimization method is proposed to first determine the proper attribute set of the current product, and then select prototypes to guide the prediction of the chosen attributes. Experiments on the popular multi-modal e-commerce benchmarks show that our approach achieves superior performance over the other state-of-the-art techniques.
翻译:随着电子商务行业的繁荣发展,各种模态,如视觉和语言,被用于描述产品项。理解这样多样化的数据是一个巨大的挑战,尤其是通过利用有用的图像区域来提取文本序列中的属性-值对。尽管以前一系列的工作已经致力于这项任务,但仍然存在阻碍进一步改进的很少探索过的障碍:1)上游单模预训练的参数没有充分应用,缺乏适当的联合微调,在下游多模任务中。2)为了选择图像的描述性部分,广泛应用简单的后期融合,而不考虑先验知识,即相关语言信息应通过更强的编码器编码为共同的语言嵌入空间。3)由于产品之间的多样性,它们的属性集往往变化很大,但目前的方法通过不必要的最大范围进行预测,并导致更多的误报风险。为了解决这些问题,我们在本文中提出了一种新方法,通过统一的学习方案和动态范围最小化来提高多模电子商务属性值提取:1)首先,设计了一个统一的方案,以预训练的单模参数联合训练多模任务。2)其次,提出了一种由文本指导的信息范围最小化方法,将每个模态的描述性部分自适应地编码为与强大的预训练语言模型的相同空间。3)此外,提出了一种基于原型的属性范围最小化方法,首先确定当前产品的适当属性集,然后选择原型来指导所选属性的预测。在流行的多模电子商务基准测试上的实验证明,我们的方法比其他最先进的技术实现了更优异的性能。