Can foundation models (such as ChatGPT) clean your data? In this proposal, we demonstrate that indeed ChatGPT can assist in data cleaning by suggesting corrections for specific cells in a data table (scenario 1). However, ChatGPT may struggle with datasets it has never encountered before (e.g., local enterprise data) or when the user requires an explanation of the source of the suggested clean values. To address these issues, we developed a retrieval-based method that complements ChatGPT's power with a user-provided data lake. The data lake is first indexed, we then retrieve the top-k relevant tuples to the user's query tuple and finally leverage ChatGPT to infer the correct value (scenario 2). Nevertheless, sharing enterprise data with ChatGPT, an externally hosted model, might not be feasible for privacy reasons. To assist with this scenario, we developed a custom RoBERTa-based foundation model that can be locally deployed. By fine-tuning it on a small number of examples, it can effectively make value inferences based on the retrieved tuples (scenario 3). Our proposed system, RetClean, seamlessly supports all three scenarios and provides a user-friendly GUI that enables the VLDB audience to explore and experiment with the system.
翻译:能否使用基础模型(例如ChatGPT)清理您的数据?在这篇论文中,我们展示了ChatGPT确实可以通过为数据表中特定单元格提供纠正建议来协助数据清理(场景1)。然而,ChatGPT可能在遇到以前从未遇到过的数据集(例如,本地企业数据)或用户需要解释所建议的清理值来源时会遇到困难。为了解决这些问题,我们开发了一种基于检索的方法,将用户提供的数据湖与ChatGPT的能力相结合。首先对数据湖进行索引,然后检索与用户查询元组最相关的前k个元组,最后利用ChatGPT推断正确的值(场景2)。然而,出于隐私原因,将企业数据与ChatGPT这样的外部托管模型共享可能不可行。为了协助这种情况,我们开发了一种定制的基于RoBERTa的基础模型,可在本地部署。通过对少量示例进行微调,它可以有效地进行值推断,这些值是基于检索到的元组(场景3)。我们提出的系统RetClean无缝支持这三种场景,并提供了一种用户友好的GUI,使VLDB读者能够探索和使用该系统。