利用测试驱动开发与大型语言模型实现可靠可验证的电子表格代码生成：一个研究框架 (Leveraging Test Driven Development with Large Language Models for Reliable and Verifiable Spreadsheet Code Generation: A Research Framework)

Large Language Models (LLMs), such as ChatGPT, are increasingly leveraged for generating both traditional software code and spreadsheet logic. Despite their impressive generative capabilities, these models frequently exhibit critical issues such as hallucinations, subtle logical inconsistencies, and syntactic errors, risks particularly acute in high stakes domains like financial modelling and scientific computations, where accuracy and reliability are paramount. This position paper proposes a structured research framework that integrates the proven software engineering practice of Test-Driven Development (TDD) with Large Language Model (LLM) driven generation to enhance the correctness of, reliability of, and user confidence in generated outputs. We hypothesise that a "test first" methodology provides both technical constraints and cognitive scaffolding, guiding LLM outputs towards more accurate, verifiable, and comprehensible solutions. Our framework, applicable across diverse programming contexts, from spreadsheet formula generation to scripting languages such as Python and strongly typed languages like Rust, includes an explicitly outlined experimental design with clearly defined participant groups, evaluation metrics, and illustrative TDD based prompting examples. By emphasising test driven thinking, we aim to improve computational thinking, prompt engineering skills, and user engagement, particularly benefiting spreadsheet users who often lack formal programming training yet face serious consequences from logical errors. We invite collaboration to refine and empirically evaluate this approach, ultimately aiming to establish responsible and reliable LLM integration in both educational and professional development practices.

翻译：以ChatGPT为代表的大型语言模型正日益广泛地应用于传统软件代码与电子表格逻辑的生成。尽管这些模型展现出卓越的生成能力，却常出现关键性问题，如幻觉效应、细微逻辑不一致及语法错误。在金融建模与科学计算等高风险领域，此类风险尤为突出，因其对准确性与可靠性要求极高。本立场论文提出一个结构化研究框架，将经过验证的测试驱动开发软件工程实践与大型语言模型驱动生成相结合，旨在提升生成结果的正确性、可靠性及用户信任度。我们假设"测试先行"的方法论既能提供技术约束，又能构建认知支架，从而引导大型语言模型输出更精确、可验证且易于理解的解决方案。该框架适用于从电子表格公式生成到Python等脚本语言乃至Rust等强类型语言的多样化编程场景，包含明确阐述的实验设计，涵盖清晰定义的参与者分组、评估指标以及基于测试驱动开发的提示范例。通过强调测试驱动思维，我们致力于提升计算思维、提示工程技能与用户参与度，尤其惠及那些缺乏正规编程训练却面临逻辑错误严重后果的电子表格用户。我们诚邀各界协作完善并实证评估该方法，最终目标是在教育实践与专业开发领域建立负责任且可靠的大型语言模型集成体系。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日