In today's world data is being generated at a high rate due to which it has become inevitable to analyze this data efficiently and produce results quickly. But, data scientists and analysts are required to use different systems, because apart from SQL querying relational databases are not well equipped to perform complex data analyses. Due to this, data science frameworks are in huge demand. This may require significant data movement across multiple systems, which can be expensive. Furthermore, with relational databases, the data must be completely loaded into the database before performing any analysis. We believe that it has become the need of the hour to come up with a single system which can perform both data analysis tasks and SQL querying. Ideally, this system would offer adequate performance, scalability, built-in functionalities, and usability. We extend the Python's Dask framework to present DaskDB, a scalable data science system with support for unified data analytics and in situ SQL query processing on heterogeneous data sources. DaskDB supports invoking any Python APIs as User-Defined Functions (UDF). So, it can be easily integrated with most existing Python data science applications. Moreover, we introduce a novel distributed learned index to improve join performance. Our experimental evaluation involve the TPC-H benchmark and a custom UDF benchmark, which we developed, for data analytics. And, we demonstrate that DaskDB significantly outperforms PySpark and Hive/Hivemall.
翻译:今天的世界数据正在以很高的速度生成,这是因为在今天的世界数据中,高效地分析这些数据并快速地产生结果是不可避免的。但是,数据科学家和分析师需要使用不同的系统,因为除了SQL查询关系数据库之外,这个系统不能很好地进行复杂的数据分析。由于这个原因,数据科学框架需求巨大。这可能需要在多个系统之间大量数据流动,这可能需要昂贵的多种系统。此外,由于关系数据库,数据必须完全输入数据库,然后才能进行任何分析。我们认为,现在需要建立一个单一的系统,既执行数据分析任务,又进行SQL查询。理想的是,这个系统将提供适当的性能、可扩缩性、内建功能以及可使用性。我们扩展Pyson的达斯克框架,以展示一个可扩缩的数据分析器,一个支持统一数据分析器和对混杂数据源进行现场的SQL查询处理。DaskDB支持将任何PythAPI作为用户定义功能(UDF) 和SQL查询功能(UDF) 。因此,这个系统可以轻松地将现有数据数据库纳入我们现有数据库数据库的数据库。