ICBU可控文本生成技术详解

可控文本生成技术大图

一文本生成技术

文本生成（Text Generation）是自然语言处理（Natural Language Processing，NLP）领域的一项重要且具有挑战的任务。顾名思义，文本生成任务的目的是生成近似于自然语言的文本序列，但仍可以根据输入数据进行分类。比如输入结构化数据的 Data-to-text Generation，输入图片的 Image Caption，输入视频的 Video Summarization，输入音频的 Speech Recognition 等。本文我们聚焦于输入文本生成文本的 Text-to-Text 任务，具体地包括神经机器翻译、智能问答、生成式文本摘要等。

随着深度学习的发展，众多新兴的技术已被文本生成任务所采用。比如，为了解决文本生成中的长期依赖、超纲词（Out-of-Vocabulary，OOV）问题，注意力机制（Attention Mechanism），拷贝机制（Copy Mechanism）等应运而出；网络结构上使用了循环神经网络（Recurrent Neural Networks），卷积神经网络（Convolutional Neural Networks），图神经网络（Graph Neural Networks），Transformer 等。为了顺应“预训练-精调”范式的兴起，在海量语料上自监督地训练出的大体量预训练语言模型（Pre-trained Language Model；PLM），也被广泛应用在文本生成任务中。

为了展示上述结构、模型、机制在文本生成任务上的应用，本章第一小节会简要梳理主流文本生成模型的结构，在第二小节会对于文本生成的评价指标的方案进行归纳。

1 文本生成模型的结构

文本生成模型的结构常来自于人类撰写文本的启发。此处按照模型结构的特征，将主流文本生成模型分为如下几种：

图1：各种文本生成模型结构图示

Encoder-Decoder Framework

“编码器-解码器框架”首先使用 encoder 编码文本，再使用 decoder 基于原文编码和部分解码输出，自回归地解码（Autoregressively Decoding）出文本。这类似于，人类首先理解素材（源文本、图片、视频等），然后基于对原文的理解和已写出的内容，逐字地撰写出文本。也是目前序列到序列任务中应用最广泛的框架结构。

Auto-regressive Language Model

标准的 left-to-right 的单向语言模型，也可以根据前文序列逐字地解码出文本序列，这种依赖于前文语境来建模未来状态的解码过程，叫做自回归解码（Auto-regressive Decoding）。不同于编码器-解码器框架”使用 encoder 编码源文本，用 decoder 编码已预测的部分序列，AR LM 用同一个模型编码源文本和已解码的部分序列。

Hierarchical Encoder-Decoder

对于文本素材，人类会先理解单个句子，再理解整篇文本。在撰写文本的过程中，也需要先构思句子的大概方向，再逐字地撰写出内容。这类模型往往需要一个层次编码器对源文本进行 intra-sentence 和 inter-sentence 的编码，对应地进行层次 sentence-level 和 token-level 的解码。在 RNN 时代，层次模型分别建模来局部和全局有不同粒度的信息，往往能够带来性能提升，而 Transformer 和预训练语言模型的时代，全连接的 Self-Attention 弱化了这种优势。

Knowledge-Enriched Model

知识增强的文本生成模型，引入了外部知识，因此除了针对源文本的文本编码器外，往往还需要针对外部知识的知识编码器。知识编码器的选择可以依据外部知识的数据结构，引入知识图谱、图片、文本作为外部知识时可以对应地选用图神经网络、卷积神经网络、预训练语言模型等。融合源文本编码与知识编码时，也可以考虑注意力机制，指针生成器网络（Pointer-Generator-Network），记忆网络（Memory Networks）等。

Write-then-Edit Framework

考虑到人工撰写稿件尚不能一次成文，那么文本生成可能同样需要有“修订”的过程。人工修订稿件时，需要基于原始素材和草稿撰写终稿，模型也需要根据源文本和解码出的草稿重新进行编解码。这种考虑了原文和草稿的模型能够产生更加合理的文本内容。当然也会增加计算需求，同时生成效率也会打折扣。

表1：各种文本生成模型结构及其代表性模型

2 文本生成的评价指标

二可控文本生成

1 设计 Prompt

图3：不同任务下常用的 prompts，[X] 是源文本，[Z] 是以期生成的答案

2 构造 Control Codes

构造训练数据

设计损失函数

3 加入 Decoding Strategy

改进采样策略

引入外部反馈

4 Write-then-Edit 类

三技术总结

1 可控的思路

2 发展的趋势

四 ICBU 详情页底纹可控生成

目前，详情页底纹推荐模块上线的四种方案包括，Item2Query 基于点击行为使用 TF-IDF 关联商品的重要 Query 作为底纹，Item2Item2Query 基于主图相似度关联到的相似商品共享底纹，按照热门实体类型从标题中抽取实体的抽取式基线模型，以及我们的可控 Query 生成模型。上线后的指标也显示，在 CTR 指标上，可控生成提供的 Query 的转化率仅次于 Item2Query 基于行为关联到的 Query。

图10：业务上线指标

五可控文本生成相关数据集

StylePTB：细粒度文本风格迁移基准数据集：
https://github.com/lvyiwei1/StylePTB/
SongNet：格式可控的宋词生成任务：
https://github.com/lipiji/SongNet
GPT-2 Output：可用于构造可控文本生成数据集的大体量语料库：
https://github.com/openai/gpt-2-output-dataset
Inverse Prompting：公开领域的诗文生成，公开领域的长篇幅问答数据集：
https://github.com/THUDM/iPrompt
GYAFC (Grammarly’s Yahoo Answers Formality Corpus)：雅虎问答形式迁移语料库：
https://github.com/raosudha89/GYAFC-corpus

六参考文献

Li, Junyi, et al. "Pretrained Language Models for Text Generation: A Survey." arXiv preprint arXiv:2105.10311 (2021).
Raffel, Colin, et al. "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." Journal of Machine Learning Research 21 (2020): 1-67.（Original BART Paper）
Lewis, Mike, et al. "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension." Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020.（Original T5 Paper）
Zhang, Jingqing, et al. "Pegasus: Pre-training with extracted gap-sentences for abstractive summarization." International Conference on Machine Learning. PMLR, 2020.（Original Pegasus Paper）
Radford, Alec, et al. "Language models are unsupervised multitask learners." OpenAI blog 1.8 (2019): 9.（Original GPT-2 Paper）
Zhu, Chenguang, et al. "A Hierarchical Network for Abstractive Meeting Summarization with Cross-Domain Pretraining." Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings. 2020.（Original HMNet Paper）
Liu, Chunyi, et al. "Automatic dialogue summary generation for customer service." Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019.（Original Leader-Writer network Paper）
Jin, Hanqi, Tianming Wang, and Xiaojun Wan. "Semsum: Semantic dependency guided neural abstractive summarization." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 05. 2020.（Original Semsum Paper）
Zhu, Chenguang, et al. "Boosting factual correctness of abstractive summarization with knowledge graph." arXiv e-prints (2020): arXiv-2003.（Original FASum Paper）
Xia, Yingce, et al. "Deliberation networks: Sequence generation beyond one-pass decoding." Advances in Neural Information Processing Systems 30 (2017): 1784-1794.（Original Deliberation Networks Paper）
Wang, Qingyun, et al. "Paper Abstract Writing through Editing Mechanism." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2018.（Original Editing Mechanism Paper）
Celikyilmaz, Asli, Elizabeth Clark, and Jianfeng Gao. "Evaluation of text generation: A survey." arXiv preprint arXiv:2006.14799 (2020).
Lin, C. "Recall-oriented understudy for gisting evaluation (rouge)." Retrieved August 20 (2005): 2005.（Original ROUGE Paper）
Papineni, Kishore, et al. "Bleu: a method for automatic evaluation of machine translation." Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 2002.（Original BLEU Paper）
Li, Jiwei, et al. "A Diversity-Promoting Objective Function for Neural Conversation Models." Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016..（Original Distinct-N Paper）
Zhang, Tianyi, et al. "Bertscore: Evaluating text generation with bert." arXiv preprint arXiv:1904.09675 (2019).（Original BERTScore Paper）
Falke, Tobias, et al. "Ranking generated summaries by correctness: An interesting but challenging application for natural language inference." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.（Amazon uses NLI tools to evaluate summarization）
Li, Piji, et al. "Rigid formats controlled text generation." Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020.（Original SongNet Paper）
Lyu, Yiwei, et al. "StylePTB: A Compositional Benchmark for Fine-grained Controllable Text Style Transfer." Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021.（Original StylePTB Paper）
Keskar, Nitish Shirish, et al. "Ctrl: A conditional transformer language model for controllable generation." arXiv preprint arXiv:1909.05858 (2019).（Original CTRL Paper）
Liu, Pengfei, et al. "Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing." arXiv preprint arXiv:2107.13586 (2021).
Dou, Zi-Yi, et al. "GSum: A General Framework for Guided Neural Abstractive Summarization." Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021.（Original GSum Paper）
Chan, Alvin, et al. "CoCon: A self-supervised approach for controlled text generation." arXiv preprint arXiv:2006.03535 (2020).（Original CoCon Paper）
Holtzman, Ari, et al. "The curious case of neural text degeneration." arXiv preprint arXiv:1904.09751 (2019).（Original Nucleus Sampling Paper）
Holtzman, Ari, et al. "Learning to Write with Cooperative Discriminators." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018.（Original L2W Paper）
Krause, Ben, et al. "Gedi: Generative discriminator guided sequence generation." arXiv preprint arXiv:2009.06367 (2020).（Original GeDi Paper）
Dathathri, Sumanth, et al. "Plug and play language models: A simple approach to controlled text generation." arXiv preprint arXiv:1912.02164 (2019).（Original PPLM Paper）

七招聘

我们是服务ICBU（alibaba.com）业务的算法团队，我们主要负责ICBU业务全面的算法赋能工作，具体包括：电商搜索和推荐算法；商品知识图谱挖掘和CPV建设等数据标准化算法；视频理解与打标和视频推荐等内容化算法；新签建模和续签建模等赋能销售算法；外投预算分配、LTV建模、趋势预估与挖掘、智能触达等用户增长算法；风控与反作弊等对抗智能算法；运营权益敏感度建模等智能化运营算法；大市场流量最优化分配与机制设计算法；广告算法等等。

如果你对自然语言处理（NLP），计算机视觉（CV），机器学习&深度学习（Machine Learning&Deep Learning），组合优化（Combinatorial optimization）感兴趣，无论是工作多年的同学，还是即将毕业打算找工作的校招新生，欢迎联系我们：IcbuAlgoRecruit@list.alibaba-inc.com

阿里云容器服务使用教程

容器服务提供高性能可伸缩的容器应用管理服务，支持用Docker容器进行应用生命周期管理，提供多种应用发布方式和持续交付能力并支持微服务架构。容器服务简化了容器管理集群的搭建工作，整合了阿里云虚拟化、存储、网络和安全能力，打造Docker云端最佳运行环境。

点击阅读原文查看课程详情！