In the last few years, the concept of data lake has become trendy for data storage and analysis. Thus, several design alternatives have been proposed to build data lake systems. However, these proposals are difficult to evaluate as there are no commonly shared criteria for comparing data lake systems. Thus, we introduce DLBench, a benchmark to evaluate and compare data lake implementations that support textual and/or tabular contents. More concretely, we propose a data model made of both textual and raw tabular documents, a workload model composed of a set of various tasks, as well as a set of performance-based metrics, all relevant to the context of data lakes. As a proof of concept, we use DLBench to evaluate an open source data lake system we previously developed.
翻译:在过去几年中,数据湖的概念已成为数据储存和分析的潮流,因此,提出了建立数据湖系统的若干设计替代办法,然而,由于在比较数据湖系统方面没有共同的标准,这些提议难以评价,因此,我们采用了DLBench,这是评估和比较支持文字和(或)表格内容的数据湖执行情况的基准,更具体地说,我们提出了一个由文本和原始表格文件组成的数据模型,一个由一系列任务组成的工作量模型,以及一套基于性能的衡量标准,所有这些都与数据湖的背景有关。作为概念的证明,我们利用DLBench来评价我们以前开发的开放源数据湖系统。