复杂问答和语言模型混合架构综述 (Complex QA and language models hybrid architectures, Survey)

This paper reviews the state-of-the-art of language models architectures and strategies for "complex" question-answering (QA, CQA, CPS) with a focus on hybridization. Large Language Models (LLM) are good at leveraging public data on standard problems but once you want to tackle more specific complex questions or problems (e.g. How does the concept of personal freedom vary between different cultures ? What is the best mix of power generation methods to reduce climate change ?) you may need specific architecture, knowledge, skills, methods, sensitive data protection, explainability, human approval and versatile feedback... Recent projects like ChatGPT and GALACTICA have allowed non-specialists to grasp the great potential as well as the equally strong limitations of LLM in complex QA. In this paper, we start by reviewing required skills and evaluation techniques. We integrate findings from the robust community edited research papers BIG, BLOOM and HELM which open source, benchmark and analyze limits and challenges of LLM in terms of tasks complexity and strict evaluation on accuracy (e.g. fairness, robustness, toxicity, ...) as a baseline. We discuss some challenges associated with complex QA, including domain adaptation, decomposition and efficient multi-step QA, long form and non-factoid QA, safety and multi-sensitivity data protection, multimodal search, hallucinations, explainability and truthfulness, temporal reasoning. We analyze current solutions and promising research trends, using elements such as: hybrid LLM architectural patterns, training and prompting strategies, active human reinforcement learning supervised with AI, neuro-symbolic and structured knowledge grounding, program synthesis, iterated decomposition and others.

翻译：本文重点综述了语言模型架构和混合策略在复杂问题回答(CQA, CPS)方面的最新技术发展。大型语言模型(LLM)可以在一些标准问题上利用公共数据，但面对更加特定和复杂的问题(例如:不同文化中个人自由的概念有何不同? 什么是最佳的混合发电方式以减少气候变化?)时，就需要特定的架构、知识、技能、方法、敏感数据保护、可解释性、人工审核和多变的反馈等。ChatGPT和GALACTICA等最近的项目让非专家用户了解了大型语言模型在复杂问答方面的伟大潜力和同样强大的局限性。本文首先综述了必要的技能和评估技术。我们集成了BIG、BLOOM和HELM领域内有影响力的研究成果，这些论文通过开源、基准测试和分析了解了LLM的限制和挑战，如任务的复杂性和仅使用准确性等标准进行评估。我们讨论了一些与复杂问答相关的挑战，包括领域适应、分解和高效多步骤问答、长文本和非事实问答、安全性和多敏感数据保护、多模态搜索、幻觉、可解释性和真实性、时间推理等方面。通过使用混合LLM架构模式、训练和提示策略、人类强化学习与AI监督学习相结合、神经符号和结构化知识还原、程序合成、迭代分解等元素，我们分析了当前的解决方案和前景研究趋势。