In the last few years, the field of data science has been growing rapidly as various businesses have adopted statistical and machine learning techniques to empower their decision making and applications. Scaling data analysis, possibly including the application of custom machine learning models, to large volumes of data requires the utilization of distributed frameworks. This can lead to serious technical challenges for data analysts and reduce their productivity. AFrame, a Python data analytics library, is implemented as a layer on top of Apache AsterixDB, addressing these issues by incorporating the data scientists' development environment and transparently scaling out the evaluation of analytical operations through a Big Data management system. While AFrame is able to leverage data management facilities (e.g., indexes and query optimization) and allows users to interact with a very large volume of data, the initial version only generated SQL++ queries and only operated against Apache AsterixDB. In this work, we describe a new design that retargets AFrame's incremental query formation to other query-based database systems as well, making it more flexible for deployment against other data management systems with composable query languages.
翻译:在过去几年里,随着各企业采用统计和机器学习技术,赋予其决策和应用权力,数据科学领域迅速发展。 扩大数据分析,可能包括应用定制的机器学习模式,对大量数据进行扩大数据分析,需要利用分布式框架。这可能导致数据分析员面临严重的技术挑战,并降低其生产率。 Python 数据分析图书馆Aframe作为Apache AsterixDB顶部的一层,在Apache AsterixDB上实施,通过纳入数据科学家的开发环境和通过大数据管理系统透明地扩大分析行动评价来解决这些问题。虽然Aframe能够利用数据管理设施(例如索引和查询优化),使用户能够与大量数据互动,初始版本只生成SQL++查询,而且仅针对Apache AsterixDB。在这项工作中,我们描述了一个新设计,将Aframe的渐进查询编成与其他基于查询的数据库系统重新定位,使之更灵活地用于使用可配置查询语言的其他数据管理系统。