Multimodal conditionality in transformer-based natural language models has demonstrated state-of-the-art performance in the task of product description generation. Recent approaches condition a language model on one or more images and other textual metadata to achieve near-human performance for describing products from e-commerce stores. However, generated descriptions may exhibit degrees of inaccuracy or even contradictory claims relative to the inputs of a given product. In this paper, we propose a controllable language generation framework called Extract-Finetune-Boost (XFBoost), which addresses the problem of inaccurate low-quality inference. By using visual semantic attributes as constraints at the decoding stage of the generation process and finetuning the language model with policy gradient techniques, the XFBoost framework is found to produce significantly more descriptive text with higher image relevancy, outperforming baselines and lowering the frequency of factually inaccurate descriptions. We further demonstrate the application of XFBoost to online learning wherein human-in-the-loop critics improve language models with active feedback.
翻译:以变压器为基础的自然语言模型的多式附加条件显示了在产品描述生成任务方面最先进的表现。最近的一些做法将一种语言模型附加在一个或多个图像和其他文本元数据上,以达到接近人的描述电子商务商店产品的业绩;然而,生成的描述可能显示出与某一产品投入相比的不准确程度,甚至相互矛盾的主张。在本文件中,我们提出了一个可控制的语言生成框架,称为Expect-Finneune-Boost(XFBoost),它解决了不准确的低质量推断问题。通过将视觉语义属性作为生成过程解码阶段的制约因素,并用政策梯度技术微调语言模型,XFBoost框架可以产生更多描述性更强的文字,其图像具有更高的相关性,超过基线,并降低事实不准确描述的频率。我们进一步展示了XFBoost在网上学习中的应用情况,即当场评论家用积极反馈改进语言模型。