We study empirical Bayes estimation of the effect sizes of $N$ units from $K$ noisy observations on each unit. We show that it is possible to achieve near-Bayes optimal mean squared error, without any assumptions or knowledge about the effect size distribution or the noise. The noise distribution can be heteroskedastic and vary arbitrarily from unit to unit. Our proposal, which we call Aurora, leverages the replication inherent in the $K$ observations per unit and recasts the effect size estimation problem as a general regression problem. Aurora with linear regression provably matches the performance of a wide array of estimators including the sample mean, the trimmed mean, the sample median, as well as James-Stein shrunk versions thereof. Aurora automates effect size estimation for Internet-scale datasets, as we demonstrate on data from a large technology firm.
翻译:我们从对每个单位的吵闹观测中从$K美元对单位单位的影响大小进行实证贝斯估计。我们表明,有可能实现接近贝耶的最佳平均正方差,而没有关于影响大小分布或噪音的任何假设或知识。噪音分布可以是异相的,而且单位之间会任意变化。我们称之为“奥罗拉”的建议利用单位观测中固有的复制美元,并将影响大小估计问题作为一般回归问题重新表述。用线性回归法可以明显地匹配一系列广泛的估计器的性能,包括样本平均值、刻度平均值、样本中位数以及詹姆斯-斯丁缩略图版本。正如我们从一个大型技术公司获得的数据所显示的那样,奥罗拉自动定能对因特网规模数据集的规模估计产生影响。