In many use-cases, information is stored in text but not available in structured data. However, extracting data from natural language text to precisely fit a schema, and thus enable querying, is a challenging task. With the rise of pre-trained Large Language Models (LLMs), there is now an effective solution to store and use information extracted from massive corpora of text documents. Thus, we envision the use of SQL queries to cover a broad range of data that is not captured by traditional databases by tapping the information in LLMs. To ground this vision, we present Galois, a prototype based on a traditional database architecture, but with new physical operators for querying the underlying LLM. The main idea is to execute some operators of the the query plan with prompts that retrieve data from the LLM. For a large class of SQL queries, querying LLMs returns well structured relations, with encouraging qualitative results. Preliminary experimental results make pre-trained LLMs a promising addition to the field of database systems, introducing a new direction for hybrid query processing. However, we pinpoint several research challenges that must be addressed to build a DBMS that exploits LLMs. While some of these challenges necessitate integrating concepts from the NLP literature, others offer novel research avenues for the DB community.
翻译:在许多情况下,信息存储在文本中,但不可用于结构化数据。然而,从自然语言文本中提取数据以精确适应模式,并因此实现查询,是一项具有挑战性的任务。随着预训练的大型语言模型(LLM)的崛起,现在有一种有效的解决方案,用于存储和使用从大量文本文档中提取的信息。因此,我们设想使用SQL查询来涵盖传统数据库未捕获的广泛数据范围,通过利用LLM中的信息。为了落实这一愿景,我们展示了Galois,基于传统数据库架构的原型,但具有新的物理运算符,用于查询底层LLM。主要想法是使用提示执行查询计划的某些运算符,从LLM检索数据。对于大类SQL查询,查询LLM返回结构良好的关系,并具有令人鼓舞的定性结果。初步实验结果使预训练的LLM成为数据库系统领域的一个有前途的补充,引入了混合查询处理的新方向。然而,我们指出了必须解决的几个研究挑战,以构建利用LLM的DBMS。虽然其中一些挑战需要整合NLP文献的概念,但其他挑战则为DB社区提供了新的研究方向。