Different statistical samples (e.g., from different locations) offer populations and learning systems observations with distinct statistical properties. Samples under (1) 'Unconfounded' growth preserve systems' ability to determine the independent effects of their individual variables on any outcome-of-interest (and lead, therefore, to fair and interpretable black-box predictions). Samples under (2) 'Externally-Valid' growth preserve their ability to make predictions that generalize across out-of-sample variation. The first promotes predictions that generalize over populations, the second over their shared exogeneous factors. We illustrate these theoretic patterns in the full American census from 1840 to 1940, and samples ranging from the street-level all the way to the national. This reveals sample requirements for generalizability over space, and new connections among the Shapley value, U-Statistics (Unbiased Statistics), and Hyperbolic Geometry.
翻译:不同的统计样本(例如,不同地点的统计样本)提供具有不同统计特性的人口和学习系统观测。(1) “无根据的”增长保护系统下的样本能够确定个别变量对任何利益结果的独立影响(因此导致公平和可解释的黑盒预测 ) 。(2) “外价”增长下的样本保持了做出预测的能力,这种预测能够泛泛地反映各种外差差异。第一种样本有助于对人口进行普遍预测,第二个样本超越其共同的外差因素。我们在1840年至1940年的美国全面人口普查中展示了这些理论模式,以及从街道到全国的样本。这揭示了对空间的可概括性、以及沙普利值、U-统计(不偏差统计)和双曲线几何几何测量之间的新联系的样本要求。</s>