Traditional data lakes provide critical data infrastructure for analytical workloads by enabling time travel, running SQL queries, ingesting data with ACID transactions, and visualizing petabyte-scale datasets on cloud storage. They allow organizations to break down data silos, unlock data-driven decision-making, improve operational efficiency, and reduce costs. However, as deep learning usage increases, traditional data lakes are not well-designed for applications such as natural language processing (NLP), audio processing, computer vision, and applications involving non-tabular datasets. This paper presents Deep Lake, an open-source lakehouse for deep learning applications developed at Activeloop. Deep Lake maintains the benefits of a vanilla data lake with one key difference: it stores complex data, such as images, videos, annotations, as well as tabular data, in the form of tensors and rapidly streams the data over the network to (a) Tensor Query Language, (b) in-browser visualization engine, or (c) deep learning frameworks without sacrificing GPU utilization. Datasets stored in Deep Lake can be accessed from PyTorch, TensorFlow, JAX, and integrate with numerous MLOps tools.
翻译:传统数据湖为分析工作量提供了关键的数据基础设施,如能够进行时间旅行、运行SQL查询、通过ACID交易获取数据、以可视化的方式利用Act-CID交易来获取数据,以及将云储存的碎化数据筒、解开数据驱动的决策、提高操作效率并降低成本。然而,随着深层学习使用率的增加,传统数据湖没有为自然语言处理(NLP)、音频处理、计算机视觉和涉及非大气数据集的应用等应用设计良好的数据基础设施。本文展示深湖,这是在Acenterloop开发的用于深层学习应用的开源湖湖湖。深湖保持了香草数据湖的好处,其中有一个关键差异:它储存了复杂的数据,如图像、视频、说明以及表格数据,其形式为电压和在网络上迅速传输数据,以便(a) Tesor Quer Query 语言, (b) 在浏览器可视化引擎中,或(c) 深度学习框架,但不牺牲GPU的利用。在深湖中存储的数据集可以从许多工具中、TenFlow和Mrchors。