In the business domain, where data-driven decision making is crucial, text-to-SQL is fundamental for easy natural language access to structured data. While recent LLMs have achieved strong performance in code generation, existing text-to-SQL benchmarks remain focused on factual retrieval of past records. We introduce CORGI, a new benchmark specifically designed for real-world business contexts. CORGI is composed of synthetic databases inspired by enterprises such as Doordash, Airbnb, and Lululemon. It provides questions across four increasingly complex categories of business queries: descriptive, explanatory, predictive, and recommendational. This challenge calls for causal reasoning, temporal forecasting, and strategic recommendation, reflecting multi-level and multi-step agentic intelligence. We find that LLM performance drops on high-level questions, struggling to make accurate predictions and offer actionable plans. Based on execution success rate, the CORGI benchmark is about 21% more difficult than the BIRD benchmark. This highlights the gap between popular LLMs and the need for real-world business intelligence. We release a public dataset and evaluation framework, and a website for public submissions.
翻译:在数据驱动决策至关重要的商业领域,文本到SQL技术是实现自然语言便捷访问结构化数据的基础。尽管近期大语言模型在代码生成方面表现出色,现有文本到SQL基准测试仍聚焦于历史记录的事实性检索。本文推出CORGI——一个专为真实商业场景设计的新型基准测试。CORGI采用受DoorDash、Airbnb和Lululemon等企业启发的合成数据库构建,提供涵盖描述性、解释性、预测性和建议性四大递进复杂程度的商业查询问题。该测试要求模型进行因果推理、时序预测和战略推荐,体现了多层次、多步骤的智能体认知能力。研究发现大语言模型在高层级问题上表现显著下降,难以做出准确预测并提供可执行方案。基于执行成功率评估,CORGI基准的难度较BIRD基准高出约21%,这凸显了主流大语言模型与真实商业智能需求之间的差距。我们同步公开了数据集、评估框架及支持公开提交的在线平台。