Machine learning is disruptive. At the same time, machine learning can only succeed by collaboration among many parties in multiple steps naturally as pipelines in an eco-system, such as collecting data for possible machine learning applications, collaboratively training models by multiple parties and delivering machine learning services to end users. Data is critical and penetrating in the whole machine learning pipelines. As machine learning pipelines involve many parties and, in order to be successful, have to form a constructive and dynamic eco-system, marketplaces and data pricing are fundamental in connecting and facilitating those many parties. In this article, we survey the principles and the latest research development of data pricing in machine learning pipelines. We start with a brief review of data marketplaces and pricing desiderata. Then, we focus on pricing in three important steps in machine learning pipelines. To understand pricing in the step of training data collection, we review pricing raw data sets and data labels. We also investigate pricing in the step of collaborative training of machine learning models, and overview pricing machine learning models for end users in the step of machine learning deployment. We also discuss a series of possible future directions.
翻译:机器学习是破坏性的。 同时,机器学习只能通过许多当事方之间的合作,自然而然地在生态系统中作为管道进行多种步骤的合作而取得成功,例如为可能的机器学习应用程序收集数据,多方合作培训模式,向最终用户提供机器学习服务。数据是关键,在整个机器学习管道中穿透。由于机器学习管道涉及许多当事方,而且要取得成功,就必须形成一个建设性和动态的生态系统,市场和数据定价对于连接和协助这些许多当事方至关重要。在本篇文章中,我们调查机械学习管道中数据定价的原则和最新研究发展。我们首先简要审查数据市场和定价。然后,我们侧重于机器学习管道的三个重要步骤的定价。为了了解培训数据收集步骤的定价,我们审查原始数据集和数据标签的定价。我们还调查机器学习模型合作培训步骤的定价,以及机器学习部署阶段最终用户的总体定价机器学习模式。我们还讨论一系列可能的未来方向。