Many applications involve estimating the mean of multiple binomial outcomes as a common problem -- assessing intergenerational mobility of census tracts, estimating prevalence of infectious diseases across countries, and measuring click-through rates for different demographic groups. The most standard approach is to report the plain average of each outcome. Despite simplicity, the estimates are noisy when the sample sizes or mean parameters are small. In contrast, the Empirical Bayes (EB) methods are able to boost the average accuracy by borrowing information across tasks. Nevertheless, the EB methods require a Bayesian model where the parameters are sampled from a prior distribution which, unlike the commonly-studied Gaussian case, is unidentified due to discreteness of binomial measurements. Even if the prior distribution is known, the computation is difficult when the sample sizes are heterogeneous as there is no simple joint conjugate prior for the sample size and mean parameter. In this paper, we consider the compound decision framework which treats the sample size and mean parameters as fixed quantities. We develop an approximate Stein's Unbiased Risk Estimator (SURE) for the average mean squared error given any class of estimators. For a class of machine learning-assisted linear shrinkage estimators, we establish asymptotic optimality, regret bounds, and valid inference. Unlike existing work, we work with the binomials directly without resorting to Gaussian approximations. This allows us to work with small sample sizes and/or mean parameters in both one-sample and two-sample settings. We demonstrate our approach using three datasets on firm discrimination, education outcomes, and innovation rates.
翻译:许多应用都涉及估计多个二项结果均值这一共性问题——例如评估人口普查区域的代际流动性、估计各国传染病患病率,以及测量不同人口群体的点击率。最标准的方法是直接报告各结果的简单平均值。尽管方法简单,但当样本量或均值参数较小时,估计结果会存在较大噪声。相比之下,经验贝叶斯方法能够通过跨任务借用信息来提高平均精度。然而,经验贝叶斯方法需要建立贝叶斯模型,其中参数从先验分布中抽样得到——与通常研究的高斯情形不同,由于二项测量的离散性,该先验分布无法被识别。即使已知先验分布,当样本量存在异质性时,由于样本量与均值参数不存在简单的联合共轭先验,计算也极为困难。本文采用复合决策框架,将样本量与均值参数视为固定量。我们针对任意估计量类别,为其平均均方误差构建了近似斯坦无偏风险估计量。针对一类机器学习辅助的线性收缩估计量,我们建立了渐近最优性、遗憾界和有效推断方法。与现有研究不同,我们直接处理二项分布而无需借助高斯近似。这使得我们能够在单样本和双样本设定中处理小样本量和/或小均值参数的情况。我们通过企业歧视、教育成果和创新率三个数据集验证了所提方法。