云层数据分析工作绩效模型培训数据减少培训数据 (Training Data Reduction for Performance Models of Data Analytics Jobs in the Cloud) - 专知论文

会员服务 ·

0

数据缩减 · Performer · MoDELS · 簇 · 训练数据 ·

2021 年 11 月 15 日

Training Data Reduction for Performance Models of Data Analytics Jobs in the Cloud

翻译：云层数据分析工作绩效模型培训数据减少培训数据

Jonathan Will,Onur Arslan,Jonathan Bader,Dominik Scheinert,Lauritz Thamsen

from arxiv, 6 pages, 5 figures, Accepted for the BPOD Workshop at IEEE Big Data 2021

Distributed dataflow systems like Apache Flink and Apache Spark simplify processing large amounts of data on clusters in a data-parallel manner. However, choosing suitable cluster resources for distributed dataflow jobs in both type and number is difficult, especially for users who do not have access to previous performance metrics. One approach to overcoming this issue is to have users share runtime metrics to train context-aware performance models that help find a suitable configuration for the job at hand. A problem when sharing runtime data instead of trained models or model parameters is that the data size can grow substantially over time. This paper examines several clustering techniques to minimize training data size while keeping the associated performance models accurate. Our results indicate that efficiency gains in data transfer, storage, and model training can be achieved through training data reduction. In the evaluation of our solution on a dataset of runtime data from 930 unique distributed dataflow jobs, we observed that, on average, a 75% data reduction only increases prediction errors by one percentage point.

翻译：Apache Flink 和 Apache Spark 等分布式数据流系统简化了以数据平行方式处理大量集群数据的方法。然而,选择合适的组群资源,在类型和数量上都用于分布式数据流工作是困难的,特别是对于没有机会获得以前的业绩指标的用户而言。解决这一问题的一个办法是让用户共享运行时间指标,以培训有助于找到当前工作适当配置的符合背景的运行性能模型。共享运行时间数据而不是经过培训的模型或模型参数的一个问题就是数据规模会随着时间的推移大幅增长。本文审查了几个组群技术,以尽量减少培训数据规模,同时保持相关性能模型的准确性能。我们的结果表明,数据传输、存储和模型培训的效率可以通过培训数据减少来实现。在评估930个独特的分布式数据流工作中运行时间数据数据集时,我们发现,平均75%的数据减少只会增加一个百分点的预测误差。

0

相关内容

数据缩减

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

51+阅读 · 2020年12月14日

神经常微分方程教程，50页ppt，A brief tutorial on Neural ODEs

神经常微分方程教程，50页ppt，A brief tutorial on Neural ODEs

专知会员服务

74+阅读 · 2020年8月2日

数据科学导论，54页ppt，Introduction to Data Science

数据科学导论，54页ppt，Introduction to Data Science

专知会员服务

42+阅读 · 2020年7月27日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【实用书】流数据处理，Streaming Data，219页pdf

【实用书】流数据处理，Streaming Data，219页pdf

专知会员服务

77+阅读 · 2020年4月24日

【大规模数据系统，552页ppt】Large-scale Data Systems

【大规模数据系统，552页ppt】Large-scale Data Systems

专知会员服务

61+阅读 · 2019年12月21日

【目标检测 | 2019最新综述】目标检测中的不平衡问题，附31页PDF， Imbalance Problems in Object Detection: A Review

【目标检测 | 2019最新综述】目标检测中的不平衡问题，附31页PDF， Imbalance Problems in Object Detection: A Review

专知会员服务

46+阅读 · 2019年11月15日

【O'Reilly TensorFlow Conference 2019】TensorFlow，开源和IBM（TensorFlow, open source, and IBM ），IBM | Fred Reiss

【O'Reilly TensorFlow Conference 2019】TensorFlow，开源和IBM（TensorFlow, open source, and IBM ），IBM | Fred Reiss

专知会员服务

11+阅读 · 2019年11月14日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

计算机 | CCF推荐期刊专刊信息5条

计算机 | CCF推荐期刊专刊信息5条

Call4Papers

3+阅读 · 2019年4月10日

Call for Participation: Shared Tasks in NLPCC 2019

Call for Participation: Shared Tasks in NLPCC 2019

中国计算机学会

5+阅读 · 2019年3月22日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

计算机 | CCF推荐会议信息10条

计算机 | CCF推荐会议信息10条

Call4Papers

5+阅读 · 2018年10月18日

Reinforcement Learning: An Introduction 2018第二版 500页

Reinforcement Learning: An Introduction 2018第二版 500页

CreateAMind

14+阅读 · 2018年4月27日

计算机类 | 期刊专刊截稿信息9条

计算机类 | 期刊专刊截稿信息9条

Call4Papers

4+阅读 · 2018年1月26日

【推荐】(TensorFlow)SSD实时手部检测与追踪（附代码）

【推荐】(TensorFlow)SSD实时手部检测与追踪（附代码）

机器学习研究会

11+阅读 · 2017年12月5日

分布式TensorFlow入门指南

分布式TensorFlow入门指南

机器学习研究会

4+阅读 · 2017年11月28日

【推荐】YOLO实时目标检测(6fps)

【推荐】YOLO实时目标检测(6fps)

机器学习研究会

20+阅读 · 2017年11月5日

On Utility and Privacy in Synthetic Genomic Data

Arxiv

0+阅读 · 2022年1月18日

ZeroPrompt: Scaling Prompt-Based Pretraining to 1,000 Tasks Improves Zero-Shot Generalization

Arxiv

0+阅读 · 2022年1月18日

Proactive Query Expansion for Streaming Data Using External Source

Arxiv

0+阅读 · 2022年1月17日

Enhancement of Healthcare Data Performance Metrics using Neural Network Machine Learning Algorithms

Arxiv

0+阅读 · 2022年1月16日

ExtraPhrase: Efficient Data Augmentation for Abstractive Summarization

Arxiv

0+阅读 · 2022年1月14日

Towards a Data Privacy-Predictive Performance Trade-off

Arxiv

0+阅读 · 2022年1月13日

Assemble Foundation Models for Automatic Code Summarization

Arxiv

0+阅读 · 2022年1月13日

The Importance of Modeling Data Missingness in Algorithmic Fairness: A Causal Perspective

Arxiv

5+阅读 · 2020年12月21日

Does Data Augmentation Benefit from Split BatchNorms

Does Data Augmentation Benefit from Split BatchNorms

Arxiv

3+阅读 · 2020年10月15日

Efficient Eligibility Traces for Deep Reinforcement Learning

Arxiv

4+阅读 · 2018年10月23日

VIP会员

文章信息

相关主题

相关VIP内容

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

51+阅读 · 2020年12月14日

神经常微分方程教程，50页ppt，A brief tutorial on Neural ODEs

神经常微分方程教程，50页ppt，A brief tutorial on Neural ODEs

专知会员服务

74+阅读 · 2020年8月2日

数据科学导论，54页ppt，Introduction to Data Science

数据科学导论，54页ppt，Introduction to Data Science

专知会员服务

42+阅读 · 2020年7月27日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【实用书】流数据处理，Streaming Data，219页pdf

【实用书】流数据处理，Streaming Data，219页pdf

专知会员服务

77+阅读 · 2020年4月24日

【大规模数据系统，552页ppt】Large-scale Data Systems

【大规模数据系统，552页ppt】Large-scale Data Systems

专知会员服务

61+阅读 · 2019年12月21日

【目标检测 | 2019最新综述】目标检测中的不平衡问题，附31页PDF， Imbalance Problems in Object Detection: A Review

【目标检测 | 2019最新综述】目标检测中的不平衡问题，附31页PDF， Imbalance Problems in Object Detection: A Review

专知会员服务

46+阅读 · 2019年11月15日

【O'Reilly TensorFlow Conference 2019】TensorFlow，开源和IBM（TensorFlow, open source, and IBM ），IBM | Fred Reiss

【O'Reilly TensorFlow Conference 2019】TensorFlow，开源和IBM（TensorFlow, open source, and IBM ），IBM | Fred Reiss

专知会员服务

11+阅读 · 2019年11月14日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

【博士论文】扩展可扩展会话推荐的边界

别想太多：高效 R1 风格大型推理模型综述

【ACMMM2025】EvoVLMA: 进化式视觉-语言模型自适应

智能体网络：用AI智能体编织下一代网络

相关资讯

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

计算机 | CCF推荐期刊专刊信息5条

计算机 | CCF推荐期刊专刊信息5条

Call4Papers

3+阅读 · 2019年4月10日

Call for Participation: Shared Tasks in NLPCC 2019

Call for Participation: Shared Tasks in NLPCC 2019

中国计算机学会

5+阅读 · 2019年3月22日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

计算机 | CCF推荐会议信息10条

计算机 | CCF推荐会议信息10条

Call4Papers

5+阅读 · 2018年10月18日

Reinforcement Learning: An Introduction 2018第二版 500页

Reinforcement Learning: An Introduction 2018第二版 500页

CreateAMind

14+阅读 · 2018年4月27日

计算机类 | 期刊专刊截稿信息9条

计算机类 | 期刊专刊截稿信息9条

Call4Papers

4+阅读 · 2018年1月26日

【推荐】(TensorFlow)SSD实时手部检测与追踪（附代码）

【推荐】(TensorFlow)SSD实时手部检测与追踪（附代码）

机器学习研究会

11+阅读 · 2017年12月5日

分布式TensorFlow入门指南

分布式TensorFlow入门指南

机器学习研究会

4+阅读 · 2017年11月28日

【推荐】YOLO实时目标检测(6fps)

【推荐】YOLO实时目标检测(6fps)

机器学习研究会

20+阅读 · 2017年11月5日

相关论文

On Utility and Privacy in Synthetic Genomic Data

Arxiv

0+阅读 · 2022年1月18日

ZeroPrompt: Scaling Prompt-Based Pretraining to 1,000 Tasks Improves Zero-Shot Generalization

Arxiv

0+阅读 · 2022年1月18日

Proactive Query Expansion for Streaming Data Using External Source

Arxiv

0+阅读 · 2022年1月17日

Enhancement of Healthcare Data Performance Metrics using Neural Network Machine Learning Algorithms

Arxiv

0+阅读 · 2022年1月16日

ExtraPhrase: Efficient Data Augmentation for Abstractive Summarization

Arxiv

0+阅读 · 2022年1月14日

Towards a Data Privacy-Predictive Performance Trade-off

Arxiv

0+阅读 · 2022年1月13日

Assemble Foundation Models for Automatic Code Summarization

Arxiv

0+阅读 · 2022年1月13日

The Importance of Modeling Data Missingness in Algorithmic Fairness: A Causal Perspective

Arxiv

5+阅读 · 2020年12月21日

Does Data Augmentation Benefit from Split BatchNorms

Does Data Augmentation Benefit from Split BatchNorms

Arxiv

3+阅读 · 2020年10月15日

Efficient Eligibility Traces for Deep Reinforcement Learning

Arxiv

4+阅读 · 2018年10月23日

微信扫码咨询专知VIP会员