数据驱动应用的数据质量评分操作框架（DQSOps） (DQSOps: Data Quality Scoring Operations Framework for Data-Driven Applications) - 专知论文

会员服务 ·

0

数据质量 · 数据流 · 数据驱动 · 操作 · 计算加速 ·

2023 年 3 月 27 日

DQSOps: Data Quality Scoring Operations Framework for Data-Driven Applications

翻译：数据驱动应用的数据质量评分操作框架（DQSOps）

Firas Bayram,Bestoun S. Ahmed,Erik Hallin,Anton Engman

from arxiv, 10 Pages The International Conference on Evaluation and Assessment in Software Engineering (EASE) conference

Data quality assessment has become a prominent component in the successful execution of complex data-driven artificial intelligence (AI) software systems. In practice, real-world applications generate huge volumes of data at speeds. These data streams require analysis and preprocessing before being permanently stored or used in a learning task. Therefore, significant attention has been paid to the systematic management and construction of high-quality datasets. Nevertheless, managing voluminous and high-velocity data streams is usually performed manually (i.e. offline), making it an impractical strategy in production environments. To address this challenge, DataOps has emerged to achieve life-cycle automation of data processes using DevOps principles. However, determining the data quality based on a fitness scale constitutes a complex task within the framework of DataOps. This paper presents a novel Data Quality Scoring Operations (DQSOps) framework that yields a quality score for production data in DataOps workflows. The framework incorporates two scoring approaches, an ML prediction-based approach that predicts the data quality score and a standard-based approach that periodically produces the ground-truth scores based on assessing several data quality dimensions. We deploy the DQSOps framework in a real-world industrial use case. The results show that DQSOps achieves significant computational speedup rates compared to the conventional approach of data quality scoring while maintaining high prediction performance.

翻译：数据质量评估已成为成功执行复杂数据驱动人工智能（AI）软件系统的重要组成部分。实践中，实际应用程序以极快的速度生成大量的数据流。这些数据流在永久存储或用于学习任务之前需要进行分析和预处理。因此，在构建高质量数据集时，人们已经对系统管理和建设付出了重要的关注。然而，在生产环境中，管理大量和高速度的数据流通常是手动进行的（即线下操作），这使得它成为不可行的策略。为解决这一挑战，DataOps出现了，以使用DevOps原则实现数据过程的生命周期自动化。然而，在DataOps框架内，根据适应度评分来确定数据质量是一个复杂的任务。本文提出了一种新颖的数据质量评分操作（DQSOps）框架，它为DataOps工作流程中的生产数据产生质量分数。该框架结合了两种评分方法，一种基于机器学习预测的方法，用于预测数据质量分数，另一种基于标准的方法，则周期性地根据评估几个数据质量维度产生基础分数。我们在实际工业用例中部署了DQSOps框架，结果表明，相比传统的数据质量评分方法，DQSOps实现了显著的计算加速率，同时保持了高预测性能。

0

相关内容

数据质量

数据是组织最具价值的资产之一。企业的数据质量与业务绩效之间存在着直接联系，高质量的数据可以使公司保持竞争力并在经济动荡时期立于不败之地。有了普遍深入的数据质量，企业在任何时候都可以信任满足所有需求的所有数据。

【ICDM 2022教程】图挖掘中的公平性:度量、算法和应用

【ICDM 2022教程】图挖掘中的公平性:度量、算法和应用

专知会员服务

28+阅读 · 2022年12月26日

【SIGMOD教程】高效数据标签的众包实践:聚合、增量重标签和定价，附180页slides

【SIGMOD教程】高效数据标签的众包实践:聚合、增量重标签和定价，附180页slides

专知会员服务

11+阅读 · 2022年10月20日

【2022新书】高效深度学习，Efficient Deep Learning Book

【2022新书】高效深度学习，Efficient Deep Learning Book

专知会员服务

125+阅读 · 2022年4月21日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【牛津大学】电子医疗记录的生成式对抗网络:应用、评估措施和数据来源综述，A review of Generative Adversarial Networks for Electronic Health Records: applications, evaluation measures and data sources

【牛津大学】电子医疗记录的生成式对抗网络:应用、评估措施和数据来源综述，A review of Generative Adversarial Networks for Electronic Health Records: applications, evaluation measures and data sources

专知会员服务

24+阅读 · 2022年3月15日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

104+阅读 · 2022年2月10日

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

专知会员服务

138+阅读 · 2022年2月6日

【PKDD2020教程】可解释人工智能XAI:算法到应用，200页ppt

【PKDD2020教程】可解释人工智能XAI:算法到应用，200页ppt

专知会员服务

101+阅读 · 2020年10月13日

2020数据工程师成长路线图

专知会员服务

19+阅读 · 2020年9月6日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

深度自进化聚类：Deep Self-Evolution Clustering

深度自进化聚类：Deep Self-Evolution Clustering

我爱读PAMI

15+阅读 · 2019年4月13日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

vae 相关论文表示学习 1

vae 相关论文表示学习 1

CreateAMind

12+阅读 · 2018年9月6日

【推荐】用TensorFlow实现LSTM社交对话股市情感分析

【推荐】用TensorFlow实现LSTM社交对话股市情感分析

机器学习研究会

11+阅读 · 2018年1月14日

【推荐】用Python/OpenCV实现增强现实

【推荐】用Python/OpenCV实现增强现实

机器学习研究会

15+阅读 · 2017年11月16日

基于社交媒体地理大数据的可感知情境的个性化旅游推荐研究

国家自然科学基金

2+阅读 · 2015年12月31日

基于人眼关注度与情感分析的电子商务智能推荐计算

国家自然科学基金

0+阅读 · 2014年12月31日

大数据环境下基于多源数据协同的个性化服务关键技术研究

国家自然科学基金

3+阅读 · 2014年12月31日

面向数据中心负载的本地存储系统能效优化技术研究

国家自然科学基金

1+阅读 · 2013年12月31日

产品测试和BIST环境下数模混合信号集成电路频谱测试与数据处理的理论和算法研究

国家自然科学基金

0+阅读 · 2013年12月31日

云计算环境下数据库查询验证及数据隐私保护研究

国家自然科学基金

0+阅读 · 2012年12月31日

面向云计算的DSS模型管理机制与模型匹配算法研究

国家自然科学基金

0+阅读 · 2011年12月31日

Reality-based Interaction用户界面模型和评估方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于液晶空间光调制器非球面CGH动态检测方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于list-mode数据的快速SART真3D PET断层重建算法的研究

国家自然科学基金

0+阅读 · 2011年12月31日

Tracking Different Ant Species: An Unsupervised Domain Adaptation Framework and a Dataset for Multi-object Tracking

Arxiv

0+阅读 · 2023年5月16日

Federated Learning over Harmonized Data Silos

Arxiv

0+阅读 · 2023年5月15日

An Adaptive Ensemble Framework for Addressing Concept Drift in IoT Data Streams

An Adaptive Ensemble Framework for Addressing Concept Drift in IoT Data Streams

Arxiv

0+阅读 · 2023年5月13日

A unified framework for dataset shift diagnostics

Arxiv

0+阅读 · 2023年5月12日

Knowledge Graph Embedding: A Survey from the Perspective of Representation Spaces

Arxiv

18+阅读 · 2022年11月7日

FederatedScope-GNN: Towards a Unified, Comprehensive and Efficient Package for Federated Graph Learning

Arxiv

11+阅读 · 2022年6月27日

Federated Graph Neural Networks: Overview, Techniques and Challenges

Arxiv

16+阅读 · 2022年2月15日

Updating Embeddings for Dynamic Knowledge Graphs

Arxiv

20+阅读 · 2021年9月22日

Recommender systems based on graph embedding techniques: A comprehensive review

Arxiv

23+阅读 · 2021年9月20日

mvn2vec: Preservation and Collaboration in Multi-View Network Embedding

Arxiv

10+阅读 · 2018年1月19日

VIP会员

文章信息

相关主题

相关VIP内容

【ICDM 2022教程】图挖掘中的公平性:度量、算法和应用

【ICDM 2022教程】图挖掘中的公平性:度量、算法和应用

专知会员服务

28+阅读 · 2022年12月26日

【SIGMOD教程】高效数据标签的众包实践:聚合、增量重标签和定价，附180页slides

【SIGMOD教程】高效数据标签的众包实践:聚合、增量重标签和定价，附180页slides

专知会员服务

11+阅读 · 2022年10月20日

【2022新书】高效深度学习，Efficient Deep Learning Book

【2022新书】高效深度学习，Efficient Deep Learning Book

专知会员服务

125+阅读 · 2022年4月21日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【牛津大学】电子医疗记录的生成式对抗网络:应用、评估措施和数据来源综述，A review of Generative Adversarial Networks for Electronic Health Records: applications, evaluation measures and data sources

【牛津大学】电子医疗记录的生成式对抗网络:应用、评估措施和数据来源综述，A review of Generative Adversarial Networks for Electronic Health Records: applications, evaluation measures and data sources

专知会员服务

24+阅读 · 2022年3月15日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

104+阅读 · 2022年2月10日

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

专知会员服务

138+阅读 · 2022年2月6日

【PKDD2020教程】可解释人工智能XAI:算法到应用，200页ppt

【PKDD2020教程】可解释人工智能XAI:算法到应用，200页ppt

专知会员服务

101+阅读 · 2020年10月13日

2020数据工程师成长路线图

专知会员服务

19+阅读 · 2020年9月6日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

【博士论文】在低维和高维空间中分析、建模和转换潜在表征

从无人机到数据：揭示边缘计算作为新作战域

可解释人工智能的基础

大规模视觉模型中的基于提示的适应：综述

相关资讯

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

深度自进化聚类：Deep Self-Evolution Clustering

深度自进化聚类：Deep Self-Evolution Clustering

我爱读PAMI

15+阅读 · 2019年4月13日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

vae 相关论文表示学习 1

vae 相关论文表示学习 1

CreateAMind

12+阅读 · 2018年9月6日

【推荐】用TensorFlow实现LSTM社交对话股市情感分析

【推荐】用TensorFlow实现LSTM社交对话股市情感分析

机器学习研究会

11+阅读 · 2018年1月14日

【推荐】用Python/OpenCV实现增强现实

【推荐】用Python/OpenCV实现增强现实

机器学习研究会

15+阅读 · 2017年11月16日

相关论文

Tracking Different Ant Species: An Unsupervised Domain Adaptation Framework and a Dataset for Multi-object Tracking

Arxiv

0+阅读 · 2023年5月16日

Federated Learning over Harmonized Data Silos

Arxiv

0+阅读 · 2023年5月15日

An Adaptive Ensemble Framework for Addressing Concept Drift in IoT Data Streams

An Adaptive Ensemble Framework for Addressing Concept Drift in IoT Data Streams

Arxiv

0+阅读 · 2023年5月13日

A unified framework for dataset shift diagnostics

Arxiv

0+阅读 · 2023年5月12日

Knowledge Graph Embedding: A Survey from the Perspective of Representation Spaces

Arxiv

18+阅读 · 2022年11月7日

FederatedScope-GNN: Towards a Unified, Comprehensive and Efficient Package for Federated Graph Learning

Arxiv

11+阅读 · 2022年6月27日

Federated Graph Neural Networks: Overview, Techniques and Challenges

Arxiv

16+阅读 · 2022年2月15日

Updating Embeddings for Dynamic Knowledge Graphs

Arxiv

20+阅读 · 2021年9月22日

Recommender systems based on graph embedding techniques: A comprehensive review

Arxiv

23+阅读 · 2021年9月20日

mvn2vec: Preservation and Collaboration in Multi-View Network Embedding

Arxiv

10+阅读 · 2018年1月19日

相关基金

基于社交媒体地理大数据的可感知情境的个性化旅游推荐研究

国家自然科学基金

2+阅读 · 2015年12月31日

基于人眼关注度与情感分析的电子商务智能推荐计算

国家自然科学基金

0+阅读 · 2014年12月31日

大数据环境下基于多源数据协同的个性化服务关键技术研究

国家自然科学基金

3+阅读 · 2014年12月31日

面向数据中心负载的本地存储系统能效优化技术研究

国家自然科学基金

1+阅读 · 2013年12月31日

产品测试和BIST环境下数模混合信号集成电路频谱测试与数据处理的理论和算法研究

国家自然科学基金

0+阅读 · 2013年12月31日

云计算环境下数据库查询验证及数据隐私保护研究

国家自然科学基金

0+阅读 · 2012年12月31日

面向云计算的DSS模型管理机制与模型匹配算法研究

国家自然科学基金

0+阅读 · 2011年12月31日

Reality-based Interaction用户界面模型和评估方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于液晶空间光调制器非球面CGH动态检测方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于list-mode数据的快速SART真3D PET断层重建算法的研究

国家自然科学基金

0+阅读 · 2011年12月31日

微信扫码咨询专知VIP会员