Large Language Models (LLMs), such as ChatGPT, are increasingly leveraged for generating both traditional software code and spreadsheet logic. Despite their impressive generative capabilities, these models frequently exhibit critical issues such as hallucinations, subtle logical inconsistencies, and syntactic errors, risks particularly acute in high stakes domains like financial modelling and scientific computations, where accuracy and reliability are paramount. This position paper proposes a structured research framework that integrates the proven software engineering practice of Test-Driven Development (TDD) with Large Language Model (LLM) driven generation to enhance the correctness of, reliability of, and user confidence in generated outputs. We hypothesise that a "test first" methodology provides both technical constraints and cognitive scaffolding, guiding LLM outputs towards more accurate, verifiable, and comprehensible solutions. Our framework, applicable across diverse programming contexts, from spreadsheet formula generation to scripting languages such as Python and strongly typed languages like Rust, includes an explicitly outlined experimental design with clearly defined participant groups, evaluation metrics, and illustrative TDD based prompting examples. By emphasising test driven thinking, we aim to improve computational thinking, prompt engineering skills, and user engagement, particularly benefiting spreadsheet users who often lack formal programming training yet face serious consequences from logical errors. We invite collaboration to refine and empirically evaluate this approach, ultimately aiming to establish responsible and reliable LLM integration in both educational and professional development practices.
翻译:以ChatGPT为代表的大型语言模型正日益广泛地应用于传统软件代码与电子表格逻辑的生成。尽管这些模型展现出卓越的生成能力,却常出现关键性问题,如幻觉效应、细微逻辑不一致及语法错误。在金融建模与科学计算等高风险领域,此类风险尤为突出,因其对准确性与可靠性要求极高。本立场论文提出一个结构化研究框架,将经过验证的测试驱动开发软件工程实践与大型语言模型驱动生成相结合,旨在提升生成结果的正确性、可靠性及用户信任度。我们假设"测试先行"的方法论既能提供技术约束,又能构建认知支架,从而引导大型语言模型输出更精确、可验证且易于理解的解决方案。该框架适用于从电子表格公式生成到Python等脚本语言乃至Rust等强类型语言的多样化编程场景,包含明确阐述的实验设计,涵盖清晰定义的参与者分组、评估指标以及基于测试驱动开发的提示范例。通过强调测试驱动思维,我们致力于提升计算思维、提示工程技能与用户参与度,尤其惠及那些缺乏正规编程训练却面临逻辑错误严重后果的电子表格用户。我们诚邀各界协作完善并实证评估该方法,最终目标是在教育实践与专业开发领域建立负责任且可靠的大型语言模型集成体系。