Despite data's central role in AI production, it remains the least understood input. As AI labs exhaust public data and turn to proprietary sources, with deals reaching hundreds of millions of dollars, research across computer science, economics, law, and policy has fragmented. We establish data economics as a coherent field through three contributions. First, we characterize data's distinctive properties -- nonrivalry, context dependence, and emergent rivalry through contamination -- and trace historical precedents for market formation in commodities such as oil and grain. Second, we present systematic documentation of AI training data deals from 2020 to 2025, revealing persistent market fragmentation, five distinct pricing mechanisms (from per-unit licensing to commissioning), and that most deals exclude original creators from compensation. Third, we propose a formal hierarchy of exchangeable data units (token, record, dataset, corpus, stream) and argue for data's explicit representation in production functions. Building on these foundations, we outline four open research problems foundational to data economics: measuring context-dependent value, balancing governance with privacy, estimating data's contribution to production, and designing mechanisms for heterogeneous, compositional goods.
翻译:尽管数据在人工智能生产中占据核心地位,但其仍是最未被充分理解的投入要素。随着人工智能实验室耗尽公共数据并转向专有数据源(交易金额已达数亿美元),计算机科学、经济学、法学和政策领域的研究呈现碎片化。我们通过三项贡献将数据经济学确立为一个连贯的学科领域。首先,我们刻画了数据的独特属性——非竞争性、情境依赖性以及通过污染产生的涌现竞争性——并追溯了石油、谷物等商品市场形成的历史先例。其次,我们系统性地记录了2020年至2025年的人工智能训练数据交易,揭示了持续的市场分割、五种不同的定价机制(从按单位许可到委托定制),以及大多数交易将原始创作者排除在补偿之外的现象。第三,我们提出了一个可交换数据单元的正式层级结构(token、记录、数据集、语料库、数据流),并主张在生产函数中明确表征数据。基于这些基础,我们概述了数据经济学基础的四个开放研究问题:测量情境依赖的价值、平衡治理与隐私、估计数据对生产的贡献,以及为异质化、可组合的商品设计机制。