最新人机对话系统简略综述- 专知

最新人机对话系统简略综述

随着互联网、信息通讯以及人工智能技术的发展，人机对话系统（Conversational Systems）与生俱来的自然便捷性，使其作为一种与计算设备交流的新型方式，被认为是继鼠标键盘敲击、屏幕触控之后，未来的新一代交互范式。人机对话技术已经被工业界应用到各种类型的产品服务中。人们耳熟能详的有苹果公司的Siri、微软的Cortana、谷歌的Allo和百度的度秘等个人助理系统，还包括亚马逊的Echo智能家居服务系统以及阿里巴巴的小蜜电商智能客服系统等。这些人机对话产品给人们的日常生活带来了极大的便利性，影响着数以亿计的消费者用户，以阿里的智能客服助理阿里小蜜为例，在2017年阿里小蜜全年服务3.4亿名淘宝消费者，其中双十一当天人次904万，智能服务占比达到95%，智能服务解决率达到93.1%[53]。

图 1 现有主流的人机对话产品

对话系统（Conversational Systems）的研究最早源于上世纪50年代阿兰·图灵提出的“图灵测试”[1]，并在20世纪60年代麻省理工学院的人工智能实验室利用大量的规则建立了第一个名为ELIZA的聊天机器人[2]。近年来随着深度学习技术、自然语言处理技术和人工构造知识库规模的提升，对话系统涌现出大量的研究成果和方法[3-18]。文献[4]详细描述了最近人机对话系统的研究进展和发展趋势。根据对话系统的任务类型不同，对话系统框架可以分为目标导向型系统和非目标导向型系统。下面分别进行介绍。

目标导向型对话系统

目标型对话系统[4]以任务型对话和问答为代表，满足用户特定性的目标需求。早期目标导向型系统主要基于人工规则，如为旅客提供航班信息的DELPHI系统[21]。自90年代中期开始，研究者们开始尝试将对话过程看成是一个基于马尔科夫决策过程（Markov Decision Process, MDP）[22]的序贯决策问题[3,23,24]。尽管研究者在过去的几十年中推动该领域向数据驱动型对话系统发展，目前的多数商业目标导向型对话系统仍然具有很强的领域性，并较多地依赖于人工构造特征[3]。从方法角度划分，任务型对话系统可分为管道方法和端到端方法。经典管道方法包括语言理解、对话管理、语言生成三个基本的模块，如图2所示

图 2 经典管道人机对话框架

其基本处理流程为语言理解模块，将用户面向机器说出的自然语言转换为语义表示[25]，随后对话管理模块根据语义表示、语义上下文、用户元信息等，找到合适的执行动作，再根据具体的动作生成一句自然语言来回复用户[26]。自然语言理解模块在于进行领域意图识别，解析填充预先定义的语义槽。方法包括文本分类[27]和序列标注，主流模型采用基于深度学习的循环神经网络RNN[28]、LSTM[29]等变体来实现。对话管理包含对话状态跟踪和对话策略学习，是确保对话系统健壮性的核心组件。对话状态跟踪在对话的每一轮次对用户的目标进行预估，管理每个回合的输入和对话历史，输出当前对话状态。传统的方法[30]已经在大多数商业实现中得到了广泛的应用，通常采用手工规则来选择最有可能的输出结果。但是基于规则的系统容易出现频繁的错误，缺乏扩展[31]。最近深度学习采用的方法是使用一个滑动窗口输出任意数量的可能值的概率分布序列。虽然它在一个领域受过训练，但它可以很容易地转移到新的领域。此处运用较多的模型是，multi-domain RNN[32]和Neural Belief Tracker(NBT)[33]。对话状态学习根据状态跟踪器的状态表示，策略学习是生成下一个可用的系统操作。强化学习和监督学习可以用来优化策略学习[34]。最后自然语言生成选择操作进行映射并生成回复。实现方法包括基于符号式的表达[35]和基于深度学习的LSTM encoder-decoder形式[37]。上述管道式的任务对话缺乏领域迁移，现在构建端到端的任务对话成为研究热点。文献[38] 提出一种基于网络的端到端可训练任务导向型对话系统，将对话系统的学习作为学习从对话历史到系统回复的映射问题，并应用encoder-decoder模型来训练。

图 3 基于网络的端到端可训练任务导向型对话系统

文献[39]首先提出了一种端到端强化学习的方法，在对话管理中联合训练对话状态跟踪和对话策略学习，从而更有力地对系统的动作进行优化。

问答系统是机器自动回答用户提出的问题，给出答案[40，41]。智能问答技术可以追溯到计算机诞生初期的上世纪五六十年代，其中，代表性的系统包括Baseball[42]和Lunar[43]。随着Web2.0 的兴起，包括Wikipedia 、ODP 等应用在内的众多基于用户协同生成内容的互联网服务产生越来越多的高质量数据资源，以此为基础，大量的知识库以自动或半自动方式构建了起来（比如Freebase 、YAGO 、DBpedia等）。另外，随着高效的自然语言分析技术发展，工业界产生了更实用的问答系统，代表性的有Siri和Watson。问答系统简略可以划分基于知识库符号式、检索式以及生成式的方法。基于知识库问答是一个语义匹配的过程[40]，通过表示学习知识库以及用户问题的语义表示,将知识库中的实体、关系以及问句文本转换为一个低维语义空间中的数值向量,在此基础上,利用数值计算, 直接匹配与用户问句语义最相似的答案。检索式问答[44]是将自然语言的提问简化为机器可以识别的方式，包括关键词提取、提问分类与扩展、语义分析等。检索模块是用某种检索算法找到相关的句子、段落或者文章。答案提取模块是从检索结果中找到与提问答案一致的实体，通过某种方法对答案进行序，挑选概率最大的候选答案作为最终答案。

图4 检索式问答

生成式的方法[45]基于序列到序列学习模型的编码-解码框架（encoder-decoder 框架），类似于机器翻译的方式，给出答案。

非目标导向型对话系统

非目标导向型对话系统典型以聊天机器人为代表，用于满足用户娱乐消费或情感性等无目的性需求。无目标导向的对话系统可以追溯到上世纪60年代中期，麻省理工学院（MIT）的科学家Joseph Weizenbaum研发出第一个聊天机器人Eliza[2]。该系统基于简单的自然语言解析规则模仿罗杰斯学派心理治疗者的治疗过程。Colby于70年代设计了另外一个聊天机器人Parry[19]，模拟妄想型精神分类症患者。与Eliza类似，Parry同样是基于简单的自然语言解析规则进行聊天对话。随后的工作中，Hutchens和Alder[19]开始研发基于数据驱动的对话系统[20]，该系统对训练数据有较强的假定条件，需要数据在不同主题有较好的覆盖度且文本语言流畅。相比之前基于规则的系统，该对话系统取得了有限的进步。近些年，基于深度神经网络模型框架的对话系统在大规模语料库上进行训练，取得了较为显著的性能提升[5,10,17,18]。聊天机器人是通过生成方法或基于检索的方法实现的。检索式对话系统是将对话回复生成考虑成一个信息检索问题。该系统需要维护一个比较大的对话历史数据存储库，并基于信息检索技术完成对话响应。

图5 检索式对话聊天

如Anton Leuski等人开发的NPCEditor系统通过训练一个文本分类统计模型将用户的输入文本映射到数据库存储的对话历史作为输出完成对话响应[46]。检索式分为单轮和多轮回复。单轮检索聊天机器人在反应选择单轮的谈话,只有消息用于选择一个合适的回复。目前比较新的方法，利用深度卷积神经网络体系结构改进模型，学习消息和响应的表示，或直接学习两个句子的相互作用表示，然后用多层感知器来计算匹配的分数[17,18]。多轮检索回复会话越来越受到人们的关注，在多轮回答选择中，将当前的消息和先前的话语作为输入。模型选择一个自然的、与整个上下文相关的响应。重要的是要在之前的话语中找出重要的信息，并恰当地模仿话语的关系，以确保谈话的连贯性[16]。

图6 多轮对话聊天

与检索式对话系统不同，生成式对话系统是通过逐个词采样计算每时刻词的概率分布作为当前时刻的响应输出词来完成响应对话[47]。此类方法可根据内部状态生成概率更高的新序列文本作为系统响应，生成粒度更细，灵活性更强。而目前的生成式对话系统中一个主要的流派是依赖于机器翻译技术[48，49]。

图7 生成式对话聊天

如Alan Ritter等人采用基于短语的统计机器翻译模型生成对话响应，并对比了检索式对话系统性能，由人工进行评估打分，结果表明基于机器翻译生成式模型的对话响应明显优于检索式对话系统[50]。文献[51，52]提出了研究利用深度学习与强化学习进行视觉对话系统。

参考文献：

[1] Turing A. I.–Computing Machinery andIntelligence[J]. Mind, 1950, 59: 433–460.

[2] Weizenbaum J. ELIZA—a computer program for thestudy of natural language communication between man and machine[J]. Commun.ACM, 1966, 9(1): 36–45.

[3] Young S, Gašić M, Thomson B et al. POMDP-basedstatistical spoken dialog systems: A review[J]. Proceedings of the IEEE, 2013,101(5): 1160–1179.

[4] Hongshen Chen, Xiaorui Liu, Dawei Yin, and JiliangTang. 2017. A Survey on Dialogue Systems: Recent Advances and New Frontiers.SIGKDD Explor. Newsl. 19, 2 (November 2017), 25-35. DOI: https://doi.org/10.1145/3166054.3166058

[5] Zongcheng Ji, Zhengdong Lu, Hang Li: An InformationRetrieval Approach to Short Text Conversation [OL]. (2014) .arXiv:1408.6988.

[6] Lowe R, Pow N, Serban I, et al. The Ubuntu DialogueCorpus: A Large Dataset for Research in Unstructured Multi-Turn DialogueSystems[C]//Proceedings of the SIGDIAL 2015 Conference.2015:285-294.

[7] Shang L, Lu Z, Li H. Neural Responding Machine for ShortText Conversation[C]//Proceedings of the 53td Annual Meeting of the Associationfor Computational Linguistics and the 7th International Joint Conference on NaturalLanguage Processing (Volume 1: Long Papers). ACL Press, 2015: 1577-1586.

[8] Serban I V, Sordoni A, Bengio Y, et al. BuildingEnd-To-End Dialogue Systems Using Generative Hierarchical Neural NetworkModels[C]//Proceedings of the Thirtieth AAAI Conference on ArtificialIntelligence. AAAI Press, 2016: 3776-3783.

[9] Li J, Galley M, Brockett C, et al. A Persona-BasedNeural Conversation Model[C]//Proceedings of the 54th Annual Meetingof the Association for Computational Linguistics (Volume 1: Long Papers) . ACLPress, 2016: 994-1003.

[10] Feng-Lin Li, Minghui Qiu, Haiqing Chen, XiongweiWang, Xing Gao, Jun Huang, Juwei Ren, Zhongzhou Zhao, Weipeng Zhao, Lei Wang,Guwei Jin, Wei Chu: AliMe Assist : An Intelligent Assistant for Creating anInnovative E-commerce Experience. CIKM 2017: 2495-2498

[11] Iulian Vlad Serban, Alessandro Sordoni, YoshuaBengio, Aaron C. Courville, and Joelle Pineau. 2016. Building End-To-EndDialogue Systems Using Generative Hierarchical Neural Network Models. InAAAI’16. 3776–3784.

[12] Yiping Song, Rui Yan, Xiang Li, Dongyan Zhao, andMing Zhang. 2016. Two are Better than One: An Ensemble of Retrieval- andGeneration-Based Dialog Systems. arxiv preprint.

[13] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le.2014. Sequence to Sequence Learning with Neural Networks. In NIPS’14.3104–3112.

[14] Oriol Vinyals and Quoc V. Le. 2015. A NeuralConversational Model. In ICML DL Workshop’15.

[15] Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, LinaM Rojas-Barahona, Pei-Hao Su, Stefan Ultes, David Vandyke, and Steve Young.2016. A network-based end-to-end trainable task-oriented dialogue system. arXivpreprint.

[16] Yu Wu,Wei Wu, Ming Zhou, and Zhoujun Li. 2016.Sequential Match Network: A New Architecture for Multi-turn Response Selectionin Retrieval-based Chatbots.arXiv preprint (2016).

[17] Rui Yan, Yiping Song, and Hua Wu. 2016. Learning toRespond with Deep Neural Networks for Retrieval-Based Human-ComputerConversation System. In SIGIR’16. 55–64.

[18] Zhao Yan, Nan Duan, Jun-Wei Bao, Peng Chen, MingZhou, Zhoujun Li, and Jianshe Zhou. 2016. DocChat: An Information RetrievalApproach for Chatbot Engines Using Unstructured Documents. In ACL’16.

[19] Colby K M. Modeling a paranoid mind.[J]. Behavioraland Brain Sciences, 1981, 4(4): 515–560.

[20] Abu Shawar B, Atwell E. Chatbots: are they reallyuseful?[J]. LDV-Forum: Zeitschrift für Computerlinguistik undSprachtechnologie, 2007, 22(1): 29–49.

[21] Stallard D, Bobrow R. Fragment processing in theDELPHI system[J]. Proceedings of the workshop on Speech and Natural Language.Association for Computational Linguistics, 1992, 40(2): 305–310.

[22] Bellman R. A Markovian decision process[J]. JournalOf Mathematics And Mechanics, 1957, 6: 679–684.

[23] Pieraccini R, Suendermann D, Dayanidhi K et al. Arewe there yet? Research in commercial spoken dialog systems[A]. Lecture Notes inComputer Science (including subseries Lecture Notes in Artificial Intelligenceand Lecture Notes in Bioinformatics)[C]. 2009, 5729 LNAI: 3–13.

[24] Singh S, Kearns M, Litman D et al. ReinforcementLearning for Spoken Dialogue Systems[J]. Proceedings of the 13th AnnualConference on Neural Information Processing Systems (NIPS), 1999: 956–962.

[25] Tur G, De Mori R. Spoken language understanding: Systemsfor extracting semantic information from speech[M]. John Wiley & Sons,2011.

[26] Mesnil G, Dauphin Y, Yao K, et al. Using recurrentneural networks for slot filling in spoken language understanding[J]. IEEE/ACMTransactions on Audio, Speech and Language Processing (TASLP), 2015, 23(3): 530-539.

[27] 张志昌, 张宇, 刘挺, 等. 基于线索词识别和训练集扩展的中文问题分类[J]. 高技术通讯, 2009, 19(2): 111-118.

[28] Yao K, Zweig G, Hwang M, et al. Recurrent neuralnetworks for language understanding[C]//INTERSPEECH-2013 . 2013:2524-2528.

[29] Yao K, Peng B, Zhang Y, et al. Spoken languageunderstanding using long short-term memory neural networks[C]//Proceedings ofthe Spoken Language Technology Workshop (SLT). IEEE, 2014:189-194

[30] D. Goddeau, H. Meng, J. Polifroni, S. Seneff, andS. Busayapongchai. A form-based dialogue manager for spoken language applications. In SpokenLanguage, 1996. ICSLP 96. Proceedings., Fourth International Conference on,volume 2, pages 701–704. IEEE, 1996.

[31] J. D. Williams. Web-style ranking and slucombination for dialog state tracking. In SIGDIAL Conference, pages 282–291,2014.

[32] N. Mrkˇsi´c, D. ´O S´eaghdha, B. Thomson, M. Gasic,P.-H. Su, D. Vandyke, T.-H. Wen, and S. Young. Multidomain dialog statetracking using recurrent neural networks. In Proceedings of the 53rd AnnualMeet-ing of the Association for Computational Linguistics and the 7th InternationalJoint Conference on Natural Language Processing (Volume 2: Short Papers), pages794–799, Beijing, China, July 2015. Association for Computational Linguistics.

[33] N. Mrkˇsi´c, D. ´O S´eaghdha, T.-H. Wen, B.Thomson, and S. Young. Neural belief tracker: Data-driven dialogue statetracking. In Proceedings of the 55th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers), pages 1777–1788,Vancouver,Canada, July 2017. Association for Computational Linguistics.

[34] H. Cuayhuitl, S. Keizer, and O. Lemon. Strategicdialogue management via deep reinforcement learning. arxiv.org, 2015.

[35] A. Stent, R. Prasad, and M. Walker. Trainablesentence planning for complex information presentation in spoken dialogsystems. In Proceedings of the 42nd annual meeting on associationfor computational linguistics, page 79. Association for ComputationalLinguistics, 2004.

[36] T.-H. Wen, M. Gasic, N. Mrkˇsi´c, P.-H. Su, D.Vandyke, and S. Young. Semantically conditioned

[37] lstm-based natural language generation for spokendialogue systems. In Proceedings of the 2015 Conference on Empirical Methods inNatural Language Processing, pages 1711–1721, Lisbon, Portugal, September 2015.Association for Computational Linguistics.

[38] T.-H. Wen, D. Vandyke, N. Mrkˇsi´c, M. Gasic, L. M.Rojas Barahona, P.-H. Su, S. Ultes, and S. Young. A network-based end-to-endtrainable task-oriented dialogue system. In Proceedings of the 15th Conferenceof the European Chapter of the Association for Computational Linguistics:Volume 1, Long Papers, pages 438–449, Valencia, Spain, April 2017. Associationfor Computational Linguistics.

[39] T. Zhao and M. Eskenazi. Towards end-to-endlearning for dialog state tracking and management using deep reinforcementlearning. In Proceedings of the 17th Annual Meeting of the Special InterestGroup on Discourse and Dialogue, pages 1–10, Los Angeles, September 2016.Association for Computational Linguistics.

[40] 刘康, 张元哲, 纪国良, 来斯惟, 赵军. 基于表示学习的知识库问答研究进展与展望. 自动化学报, 2016, 42(6): 807-818.

[41] 郑实福,刘挺,秦兵,李生. 自动问答综述[J].中文信息学报, 2002, 16(6): 47-53.

[42] Green Jr, B. F., Wolf, A. K., Chomsky, C., andLaughery, K. Baseball: an automatic question-answer. In Papers presented at theMay 9-11, 1961, western joint IRE-AIEE-ACM computer conference (1961), ACM, pp.219–224.

[43] Woods, W. A. Progress in natural languageunderstanding: an application to lunar geology. In Proceedings of the June 4-8,1973, national computer conference and exposition (1973), ACM, pp. 441–450.

[45] Jun Yin, Xin Jiang, Zhengdong Lu, Lifeng Shang,Hang Li, Xiaoming Li: Neural Generative Question Answering. IJCAI 2016:2972-2978

[46] Leuski A, Traum D. NPCEditor: A Tool for BuildingQuestion-Answering Characters[J]. Lrec 2010 - Seventh International Conferenceon Language Resources and Evaluation, 2010: 2463–2470.

[47] Serban I V, Lowe R, Charlin L etal. A Survey of Available Corpora for BuildingData-Driven Dialogue Systems[J]. CoRR, 2015: 46.

[48] Langner B, Vogel S, Black A W. Evaluating a DialogLanguage Generation System: Comparing the MOUNTAIN System to Other NLGApproaches[J]. Eleventh Annual Conference of the International SpeechCommunication Association, 2010(September): 1109–1112.

[49] L. Shang, Z. Lu, and H. Li. Neural respondingmachine for short-text conversation. In Proceedings of the 53rdAnnual Meeting of the Association for Computational Linguistics and the 7thInternational Joint Conference on Natural Language Processing (Volume 1: LongPapers), pages 1577–1586, Beijing, China, July 2015. Association forComputational Linguistics.

[50] Ritter A, Cherry C, Dolan W B. Data-driven responsegeneration in social media[J]. Proceedings of the Conference on EmpiricalMethods in Natural Language Processing (EMNLP’11), 2011: 583–593.

[51] Abhishek Das, Satwik Kottur, Khushi Gupta, AviSingh, Deshraj Yadav, José M. F. Moura, Devi Parikh, Dhruv Batra: VisualDialog. CVPR 2017: 1080-1089

[52] Abhishek Das, Satwik Kottur, José M. F. Moura,Stefan Lee, Dhruv Batra:Learning Cooperative Visual Dialog Agents with DeepReinforcement Learning. ICCV 2017: 2970-2979

[53] 阿里小蜜这一年，经历了哪些技术变迁？

-END-

专 · 知

人工智能领域主题知识资料查看获取：【专知荟萃】人工智能领域26个主题知识资料全集（入门/进阶/论文/综述/视频/专家等）

请PC登录www.zhuanzhi.ai或者点击阅读原文，注册登录专知，获取更多AI知识资料！