With the prosperity of e-commerce industry, various modalities, e.g., vision and language, are utilized to describe product items. It is an enormous challenge to understand such diversified data, especially via extracting the attribute-value pairs in text sequences with the aid of helpful image regions. Although a series of previous works have been dedicated to this task, there remain seldomly investigated obstacles that hinder further improvements: 1) Parameters from up-stream single-modal pretraining are inadequately applied, without proper jointly fine-tuning in a down-stream multi-modal task. 2) To select descriptive parts of images, a simple late fusion is widely applied, regardless of priori knowledge that language-related information should be encoded into a common linguistic embedding space by stronger encoders. 3) Due to diversity across products, their attribute sets tend to vary greatly, but current approaches predict with an unnecessary maximal range and lead to more potential false positives. To address these issues, we propose in this paper a novel approach to boost multi-modal e-commerce attribute value extraction via unified learning scheme and dynamic range minimization: 1) Firstly, a unified scheme is designed to jointly train a multi-modal task with pretrained single-modal parameters. 2) Secondly, a text-guided information range minimization method is proposed to adaptively encode descriptive parts of each modality into an identical space with a powerful pretrained linguistic model. 3) Moreover, a prototype-guided attribute range minimization method is proposed to first determine the proper attribute set of the current product, and then select prototypes to guide the prediction of the chosen attributes. Experiments on the popular multi-modal e-commerce benchmarks show that our approach achieves superior performance over the other state-of-the-art techniques.
翻译:随着电子商务业的繁荣,利用多种模式,例如愿景和语言,来描述产品项目,理解这种多样化数据是一项巨大的挑战,特别是利用有帮助的图像区域,在文本序列中提取属性值对对等,这是理解这种多样化数据的巨大挑战。虽然以前的一系列著作都专门致力于这项任务,但仍然很少调查妨碍进一步改进的障碍:(1) 上游单一模式前培训的参数没有得到充分利用,没有在下游多模式任务中进行适当的联合微调。(2) 图像中选择的描述性部分,简单迟化的组合被广泛应用,而不论事先知道与语言有关的信息应当通过更强的编码程序在文本序列中编码成共同的语言嵌入空间。(3) 由于各种产品的多样性,其属性组合往往差异很大,但目前的做法预示着不必要的最大范围,并导致更多潜在的虚假积极因素。 为了解决这些问题,我们在本文件中建议采用新的方法,通过统一学习计划和动态最小化范围,促进多模式的电子商务选择值提取:首先是选择正确的电子基准,目前制定的统一计划,然后是共同培训一个多式的版本版本的版本。