以产品为导向、以产品为导向的跨多语跨语跨语培训前的机器翻译 (Product-oriented Machine Translation with Cross-modal Cross-lingual Pre-training)

Translating e-commercial product descriptions, a.k.a product-oriented machine translation (PMT), is essential to serve e-shoppers all over the world. However, due to the domain specialty, the PMT task is more challenging than traditional machine translation problems. Firstly, there are many specialized jargons in the product description, which are ambiguous to translate without the product image. Secondly, product descriptions are related to the image in more complicated ways than standard image descriptions, involving various visual aspects such as objects, shapes, colors or even subjective styles. Moreover, existing PMT datasets are small in scale to support the research. In this paper, we first construct a large-scale bilingual product description dataset called Fashion-MMT, which contains over 114k noisy and 40k manually cleaned description translations with multiple product images. To effectively learn semantic alignments among product images and bilingual texts in translation, we design a unified product-oriented cross-modal cross-lingual model (\upoc~) for pre-training and fine-tuning. Experiments on the Fashion-MMT and Multi30k datasets show that our model significantly outperforms the state-of-the-art models even pre-trained on the same dataset. It is also shown to benefit more from large-scale noisy data to improve the translation quality. We will release the dataset and codes at https://github.com/syuqings/Fashion-MMT.

翻译：翻译电子商业产品描述,即 a.k.a 产品导向机器翻译(PMT),对于为世界各地的电子直升机提供服务至关重要。然而,由于域域专长,PMT任务比传统机器翻译问题更具挑战性。首先,产品描述中有许多专门化的术语,这些术语含混不清,无需产品图像即可翻译。第二,产品描述与图像描述有关,其方式比标准图像描述更为复杂,涉及各种视觉方面,如对象、形状、颜色甚至主观风格。此外,现有的PMT数据集规模小,不足以支持研究。在本文件中,我们首先建造了一个名为“时装-MMMMMTT”的大型双语产品描述数据集,该数据集包含超过114公里的吵闹和40公里人工清洁的描述翻译,并包含多种产品图像。要有效地学习产品图像和双语文本之间的语义调整,我们设计了一个统一的面向产品的跨式跨式跨语言模式(\upoco),用于预培训和微调。在Fashion-MMT和MUT-30k数据翻译模型上进行实验。我们所展示的数据质量模型的模型将大大改进到大比例数据转换。

相关内容

Machine Translation

关注 210

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

【Facebook AI】无监督机器翻译，336页ppt，Unsupervised Machine Translation

专知会员服务

19+阅读 · 2020年11月17日

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

专知会员服务

27+阅读 · 2020年4月5日

【Google】无监督机器翻译，Unsupervised Machine Translation

专知会员服务

36+阅读 · 2020年3月3日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日