A data product is created with the intention of solving a specific problem, addressing a specific business usecase or meeting a particular need, going beyond just serving data as a raw asset. Data products enable end users to gain greater insights about their data. Since it was first introduced over a decade ago, there has been considerable work, especially in industry, to create data products manually or semi-automatically. However, there exists hardly any benchmark to evaluate automatic data product creation. In this work, we present a benchmark, first of its kind, for this task. We call it DP-Bench. We describe how this benchmark was created by taking advantage of existing work in ELT (Extract-Load-Transform) and Text-to-SQL benchmarks. We also propose a number of LLM based approaches that can be considered as baselines for generating data products automatically. We make the DP-Bench and supplementary materials available in https://huggingface.co/datasets/ibm-research/dp-bench .
翻译:数据产品的创建旨在解决特定问题、应对特定业务用例或满足特定需求,超越了仅将数据作为原始资产提供。数据产品使最终用户能够更深入地洞察其数据。自十多年前首次提出以来,尤其是在工业界,已有大量工作通过手动或半自动方式创建数据产品。然而,目前几乎没有任何基准可用于评估自动数据产品创建。在本工作中,我们提出了首个针对此任务的基准,称之为DP-Bench。我们描述了如何利用ELT(提取-加载-转换)和Text-to-SQL基准的现有工作来创建该基准。同时,我们提出了若干基于大型语言模型的方法,可作为自动生成数据产品的基线。我们将DP-Bench及相关补充材料发布于https://huggingface.co/datasets/ibm-research/dp-bench。