Recently, online shopping has gradually become a common way of shopping for people all over the world. Wonderful merchandise advertisements often attract more people to buy. These advertisements properly integrate multimodal multi-structured information of commodities, such as visual spatial information and fine-grained structure information. However, traditional multimodal text generation focuses on the conventional description of what existed and happened, which does not match the requirement of advertisement copywriting in the real world. Because advertisement copywriting has a vivid language style and higher requirements of faithfulness. Unfortunately, there is a lack of reusable evaluation frameworks and a scarcity of datasets. Therefore, we present a dataset, E-MMAD (e-commercial multimodal multi-structured advertisement copywriting), which requires, and supports much more detailed information in text generation. Noticeably, it is one of the largest video captioning datasets in this field. Accordingly, we propose a baseline method and faithfulness evaluation metric on the strength of structured information reasoning to solve the demand in reality on this dataset. It surpasses the previous methods by a large margin on all metrics. The dataset and method are coming soon on \url{https://e-mmad.github.io/e-mmad.net/index.html}.
翻译:最近,网上购物逐渐成为世界各地人们买东西的一种常见方式。美食广告往往吸引更多的人购买。这些广告恰当地整合了多式商品多结构信息,如视觉空间信息和精细结构信息。然而,传统多式联运文本的生成侧重于传统描述存在和已经发生的情况,这与现实世界的广告抄写要求不符。因为广告抄写具有生动的语言风格和更高的忠诚要求。不幸的是,缺乏可重复使用的评价框架和数据集稀缺。因此,我们展示了一个数据集,即E-MMAD(电子商业多式联运多结构广告抄写),这要求并且支持在文本生成中提供更详细得多的信息。显然,这是该领域最大的描述数据集的视频之一。因此,我们提出了一个基线方法和忠实性评价指标,说明结构化信息推理在现实中满足需求的力量。它比以往的方法要大得多,在所有指标上都比以往的方法要多得多。数据集和方法很快即将在\url{https://exnet/imad.gius.