Rumble: 大混乱数据集的数据独立 (Rumble: Data Independence for Large Messy Data Sets) - 专知论文

会员服务 ·

0

Spark SQL · Spark · 相互独立的 · Performer · SQL ·

2020 年 5 月 6 日

Rumble: Data Independence for Large Messy Data Sets

翻译：Rumble: 大混乱数据集的数据独立

Stefan Irimescu,Can Berker Cikis,Ingo Müller,Ghislain Fourny,Gustavo Alonso

from arxiv, Preprint, 9 pages

This paper introduces Rumble, an engine that executes JSONiq queries on large, heterogeneous and nested collections of JSON objects, leveraging the parallel capabilities of Spark so as to provide a high degree of data independence. The design is based on two key insights: (i) how to map JSONiq expressions to Spark transformations on RDDs and (ii) how to map JSONiq FLWOR clauses to Spark SQL on DataFrames. We have developed a working implementation of these mappings showing that JSONiq can efficiently run on Spark to query billions of objects into, at least, the TB range. The JSONiq code is concise in comparison to Spark's host languages while seamlessly supporting the nested, heterogeneous data sets that Spark SQL does not. The ability to process this kind of input, commonly found, is paramount for data cleaning and curation. The experimental analysis indicates that there is no excessive performance loss, occasionally even a gain, over Spark SQL for structured data, and a performance gain over PySpark. This demonstrates that a language such as JSONiq is a simple and viable approach to large-scale querying of denormalized, heterogeneous, arborescent data sets, in the same way as SQL can be leveraged for structured data sets. The results also illustrate that Codd's concept of data independence makes as much sense for heterogeneous, nested data sets as it does on highly structured tables.

翻译：本文介绍Rumble, 这个引擎执行Jsoniq 询问大、多式和嵌套的JSonniq 对象收藏, 利用Spark 的平行能力提供高程度的数据独立性。设计基于两个关键洞察力:(一) 如何将JSoniq 表达式映射成RDDs上的Spark 变形, (二) 如何将JSoniq FLWOR 条款映射成数据框架的SPark SQL 。我们开发了这些绘图的工作性实施方法, 显示JSoniq 可以有效地在Spark上运行, 将数十亿天天天天天天体标到TB 范围。 JSonniq 代码与Spark 的主语言比较简洁简洁, 而Spark SQL 并不完美。通常发现, 处理这类输入的能力对于数据清理和校正。实验分析显示, 与Spoint SQL 一样, 在结构数据系统中, 一种语言可以像高层次数据结构化的解算法一样, 。

0

相关内容

Spark SQL

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

【MIT】时间序列GAN，Subadditivity of Probability Divergences

专知会员服务

63+阅读 · 2020年3月4日

【2020新书】Python大数据处理，Mastering Large Datasets with Python，311页pdf

【2020新书】Python大数据处理，Mastering Large Datasets with Python，311页pdf

专知会员服务

197+阅读 · 2020年2月1日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【新书】数据科学编程傻瓜式教程，数据科学编程一体机 | Data Science Programming All-In-One For Dummies

【新书】数据科学编程傻瓜式教程，数据科学编程一体机 | Data Science Programming All-In-One For Dummies

专知会员服务

41+阅读 · 2020年1月22日

【大规模数据系统，552页ppt】Large-scale Data Systems

【大规模数据系统，552页ppt】Large-scale Data Systems

专知会员服务

61+阅读 · 2019年12月21日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【电子书推荐】Data Science with Python and Dask

【电子书推荐】Data Science with Python and Dask

专知会员服务

44+阅读 · 2019年6月1日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

meta learning 17年：MAML SNAIL

meta learning 17年：MAML SNAIL

CreateAMind

11+阅读 · 2019年1月2日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

条件GAN重大改进！cGANs with Projection Discriminator

条件GAN重大改进！cGANs with Projection Discriminator

CreateAMind

8+阅读 · 2018年2月7日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

Adversarial Variational Bayes: Unifying VAE and GAN 代码

Adversarial Variational Bayes: Unifying VAE and GAN 代码

CreateAMind

7+阅读 · 2017年10月4日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

On Feature Normalization and Data Augmentation

On Feature Normalization and Data Augmentation

Arxiv

15+阅读 · 2020年2月25日

Unsupervised Data Augmentation for Consistency Training

Arxiv

5+阅读 · 2019年7月10日

Deep Learning for Energy Markets

Deep Learning for Energy Markets

Arxiv

10+阅读 · 2019年4月10日

Red blood cell image generation for data augmentation using Conditional Generative Adversarial Networks

Red blood cell image generation for data augmentation using Conditional Generative Adversarial Networks

Arxiv

4+阅读 · 2019年1月18日

An Improved Evaluation Framework for Generative Adversarial Networks

Arxiv

3+阅读 · 2018年3月27日

Activation Maximization Generative Adversarial Nets

Arxiv

5+阅读 · 2018年1月30日

Improved Training of Generative Adversarial Networks Using Representative Features

Arxiv

7+阅读 · 2018年1月28日

Multilingual Topic Models

Arxiv

3+阅读 · 2017年12月18日

A Big Data Analysis Framework Using Apache Spark and Deep Learning

Arxiv

3+阅读 · 2017年11月25日

Multi-Task Learning with Labeled and Unlabeled Tasks

Arxiv

3+阅读 · 2017年6月8日

VIP会员

文章信息

相关主题

相互独立的

相关VIP内容

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

【MIT】时间序列GAN，Subadditivity of Probability Divergences

专知会员服务

63+阅读 · 2020年3月4日

【2020新书】Python大数据处理，Mastering Large Datasets with Python，311页pdf

【2020新书】Python大数据处理，Mastering Large Datasets with Python，311页pdf

专知会员服务

197+阅读 · 2020年2月1日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【新书】数据科学编程傻瓜式教程，数据科学编程一体机 | Data Science Programming All-In-One For Dummies

【新书】数据科学编程傻瓜式教程，数据科学编程一体机 | Data Science Programming All-In-One For Dummies

专知会员服务

41+阅读 · 2020年1月22日

【大规模数据系统，552页ppt】Large-scale Data Systems

【大规模数据系统，552页ppt】Large-scale Data Systems

专知会员服务

61+阅读 · 2019年12月21日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【电子书推荐】Data Science with Python and Dask

【电子书推荐】Data Science with Python and Dask

专知会员服务

44+阅读 · 2019年6月1日

热门VIP内容

开通专知VIP会员享更多权益服务

【博士论文】面向开放式世界的鲁棒智能体

美空军如何利用人工智能提升其兵棋推演能力

【AAAI2026】NeSTR：一种用于大型语言模型的神经-符号可溯因框架，用于时间推理

深度强化学习与模仿学习导论

相关资讯

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

meta learning 17年：MAML SNAIL

meta learning 17年：MAML SNAIL

CreateAMind

11+阅读 · 2019年1月2日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

条件GAN重大改进！cGANs with Projection Discriminator

条件GAN重大改进！cGANs with Projection Discriminator

CreateAMind

8+阅读 · 2018年2月7日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

Adversarial Variational Bayes: Unifying VAE and GAN 代码

Adversarial Variational Bayes: Unifying VAE and GAN 代码

CreateAMind

7+阅读 · 2017年10月4日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

相关论文

On Feature Normalization and Data Augmentation

On Feature Normalization and Data Augmentation

Arxiv

15+阅读 · 2020年2月25日

Unsupervised Data Augmentation for Consistency Training

Arxiv

5+阅读 · 2019年7月10日

Deep Learning for Energy Markets

Deep Learning for Energy Markets

Arxiv

10+阅读 · 2019年4月10日

Red blood cell image generation for data augmentation using Conditional Generative Adversarial Networks

Red blood cell image generation for data augmentation using Conditional Generative Adversarial Networks

Arxiv

4+阅读 · 2019年1月18日

An Improved Evaluation Framework for Generative Adversarial Networks

Arxiv

3+阅读 · 2018年3月27日

Activation Maximization Generative Adversarial Nets

Arxiv

5+阅读 · 2018年1月30日

Improved Training of Generative Adversarial Networks Using Representative Features

Arxiv

7+阅读 · 2018年1月28日

Multilingual Topic Models

Arxiv

3+阅读 · 2017年12月18日

A Big Data Analysis Framework Using Apache Spark and Deep Learning

Arxiv

3+阅读 · 2017年11月25日

Multi-Task Learning with Labeled and Unlabeled Tasks

Arxiv

3+阅读 · 2017年6月8日

微信扫码咨询专知VIP会员