We consider the problem of linear regression from strategic data sources with a public good component, i.e., when data is provided by strategic agents who seek to minimize an individual provision cost for increasing their data's precision while benefiting from the model's overall precision. In contrast to previous works, our model tackles the case where there is uncertainty on the attributes characterizing the agents' data -- a critical aspect of the problem when the number of agents is large. We provide a characterization of the game's equilibrium, which reveals an interesting connection with optimal design. Subsequently, we focus on the asymptotic behavior of the covariance of the linear regression parameters estimated via generalized least squares as the number of data sources becomes large. We provide upper and lower bounds for this covariance matrix and we show that, when the agents' provision costs are superlinear, the model's covariance converges to zero but at a slower rate relative to virtually all learning problems with exogenous data. On the other hand, if the agents' provision costs are linear, this covariance fails to converge. This shows that even the basic property of consistency of generalized least squares estimators is compromised when the data sources are strategic.
翻译:我们考虑的是具有公益成分的战略数据来源的线性回归问题,即,当由战略代理人提供数据时,当战略代理人提供的数据试图最大限度地降低提高数据精确度的单项供给成本,同时受益于模型的总体精确度。与以往的工程相比,我们的模型处理的是代理数据特征不确定的情况 -- -- 当代理数据数量巨大时,这是问题的一个关键方面。我们对游戏的平衡作了描述,揭示了与最佳设计之间的有趣联系。随后,当数据来源数量增加时,我们侧重于通过一般的最小方形估算的线性回归参数的不适应性行为。我们为这一共性矩阵提供了上下限,我们表明,当代理人提供的成本超线性时,模型的共性将集中到零,但相对于外源数据的几乎所有学习问题而言,速度将放慢。另一方面,如果代理人的供给成本是线性,则这种共性变化无法汇合在一起。这说明,即使通用的最小方位战略估计数据源处于妥协状态时,即使通用的最基本一致性性质。